paraphernalia.torch.clip module¶
Evaluate images with CLIP.
- class CLIP(prompt, anti_prompt=None, detail=None, use_tiling=True, macro=0.5, chops=64, model='ViT-B/32', device=None)[source]¶
A CLIP-based perceptor that evaluates how well an image fits with on or more target text prompts.
The underlying model is limited to (224, 224) resolution, so this class presents it with multiple perspectives on an image:
Macro: random crops of 90-100% of the image, used to counteract aliasing
Micro: small near-pixel-perfect random crops, and an optional tiling to enable the fine details of high-resolution images to be processed.
A lot of internals are exposed via methods to facilitate debugging and experimentation.
- Parameters:
prompt (Union[str, List[str]]) – the text prompt to use in general
anti_prompt (Optional[Union[str, List[str]]]) – a description to avoid
detail (Optional[Union[str, List[str]]]) – a text prompt to use for micro-perception, defaults to “A detail from a picture of {prompt}”
use_tiling (bool) –
macro (float) –
chops (int) –
model (str) –
device (Optional[str]) –
- use_tiling¶
if true, add a covering of near-pixel-perfect perceptors into the mix
- Type:
bool
- chops¶
augmentation operations, these get split 50-50 between macro and micro
- Type:
int
Initializes internal Module state, shared by both nn.Module and ScriptModule.
- encode_text(text_or_texts)[source]¶
Encode text.
Returns a detached tensor.
- Parameters:
text_or_texts (str) –
- Return type:
Tensor
- encode_image(batch)[source]¶
Encode an image.
Does not detach.
- Parameters:
batch (Tensor) –
- Return type:
Tensor
- get_macro(img)[source]¶
Get a set of high-level views on an image batch.
- Parameters:
img (Tensor) – A (b, c, h, w) image batch
- Returns:
an expanded (b, c, h, w) image batch
- Return type:
Tensor
- get_micro(img)[source]¶
Get a set of detailed (near pixel-perfect) views on an image batch.
- Parameters:
img (Tensor) – A (b, c, h, w) image batch
- Returns:
an expanded (b, c, h, w) image batch
- Return type:
Tensor
- get_similarity(img, prompts, batch_size, match='all')[source]¶
Compute the average similarity between a combined but contiguous batch of images and set of prompts.
- Parameters:
imgs (Tensor) – A combined-but-contiguous image batch with shape (batch_size * t, c, h, w)
prompts (Tensor) – A tensor of prompt embeddings with shape (n, 512)
batch_size (int) – The size of the original image batch
match (str) – Policy for multiple prompts. “any”, “all” or (in future) “one”
img (Tensor) –
- Returns:
A tensor of average similarities with shape (batch_size,)
- Return type:
Tensor