Mastodon.world admins @mwadmin

4 posts4 participants1 post today

**Cybersecurity & cyberwarfare** @cybersecurity@poliverso.org · 5h

Cybersecurity & cyberwarfare @cybersecurity@poliverso.org

Intelligenza Artificiale: Implementazione del meccanismo dell’attenzione in Python

Il meccanismo di attenzione è spesso associato all’architettura dei transformers, ma era già stato utilizzato nelle RNN (reti ricorrenti).

Nei task di traduzione automatica (ad esempio, inglese-italiano), quando si vuole prevedere la parola italiana successiva, è necessario che il modello si concentri, o presti attenzione, sulle parole inglesi più importanti nell’input, utili per ottenere una buona traduzione.

Non entrerò nei dettagli delle RNN, ma l’attenzione ha aiutato questi modelli a mitigare il problema vanishing gradient, e a catturare più dipendenze a lungo raggio tra le parole.

A un certo punto, abbiamo capito che l’unica cosa importante era il meccanismo di attenzione e che l’intera architettura RNN era superflua. Quindi, Attention is All You Need!

Self-Attention nei Transformers

L’attenzione classica indica dove le parole della sequenza in output devono porre attenzione rispetto alle parole della sequenza di input. È importante in task del tipo sequence-to-sequence come la traduzione automatica.

La self-attention è un tipo specifico di attenzione. Opera tra due elementi qualsiasi della stessa sequenza. Fornisce informazioni su quanto siano “correlate” le parole nella stessa frase.

Per un dato token (o parola) in una sequenza, la self-attention genera un elenco di pesi di attenzione corrispondenti a tutti gli altri token della sequenza. Questo processo viene applicato a ogni token della frase, ottenendo una matrice di pesi di attenzione (come nella figura).

Questa è l’idea generale, in pratica le cose sono un po’ più complicate perché vogliamo aggiungere molti parametri/pesi nell nostra rete, in modo che il modella abbia più capacità di apprendimento.

Le rappresentazioni K, V, Q

L’input del nostro modello è una frase come “mi chiamo Marcello Politi”. Con il processo di tokenizzazione, una frase viene convertita in un elenco di numeri come [2, 6, 8, 3, 1].

Prima di passare la frase al transformer, dobbiamo creare una rappresentazione densa per ogni token.

Come creare questa rappresentazione? Moltiplichiamo ogni token per una matrice. La matrice viene appresa durante l’addestramento.

Aggiungiamo ora un po’ di complessità.

Per ogni token, creiamo 3 vettori invece di uno, che chiamiamo vettori: chiave (K), valore (V) e domanda (Q). (Vedremo più avanti come creare questi 3 vettori).

Concettualmente questi 3 token hanno un significato particolare:

La chiave del vettore rappresenta l’informazione principale catturata dal token.
Il valore del vettore cattura l’informazione completa di un token.
Il vettore query, è una domanda sulla rilevanza del token per il task corrente.

L’idea è che ci concentriamo su un particolare token i e vogliamo chiedere qual è l’importanza degli altri token della frase rispetto al token i che stiamo prendendo in considerazione.

Ciò significa che prendiamo il vettore q_i (poniamo una domanda relativa a i) per il token i, e facciamo alcune operazioni matematiche con tutti gli altri token k_j (j!=i). È come se ci chiedessimo a prima vista quali sono gli altri token della sequenza che sembrano davvero importanti per capire il significato del token i.

Ma qual’è questa operazione magica?

Dobbiamo moltiplicare (dot-product) il vettore della query per i vettori delle chiavi e dividere per un fattore di normalizzazione. Questo viene fatto per ogni token k_j.

In questo modo, otteniamo uno scroe per ogni coppia (q_i, k_j). Trasformiamo questi score in una distribuzione di probabilità applicandovi un’operazione di softmax. Bene, ora abbiamo ottenuto i pesi di attenzione!

Con i pesi di attenzione, sappiamo qual è l’importanza di ogni token k_j per indistinguere il token i. Quindi ora moltiplichiamo il vettore di valore v_j associato a ogni token per il suo peso e sommiamo i vettori. In questo modo otteniamo il vettore finale context-aware del token_i.

Se stiamo calcolando il vettore denso contestuale del token_1, calcoliamo:

z1 = a11v1 + a12v2 + … + a15*v5

Dove a1j sono i pesi di attenzione del computer e v_j sono i vettori di valori.

Fatto! Quasi…

Non ho spiegato come abbiamo ottenuto i vettori k, v e q di ciascun token. Dobbiamo definire alcune matrici w_k, w_v e w_q in modo che quando moltiplichiamo:

token * w_k -> k
token * w_q -> q
token * w_v -> v

Queste tre matrici sono inizializzate in modo casuale e vengono apprese durante l’addestramento; questo è il motivo per cui abbiamo molti parametri nei modelli moderni come gli LLM.

Multi-Head Self-Attention (MHSA) nei Transformers

Siamo sicuri che il precedente meccanismo di self-attention sia in grado di catturare tutte le relazioni importanti tra i token (parole) e di creare vettori densi di quei token che abbiano davvero senso?

In realtà potrebbe non funzionare sempre perfettamente. E se, per mitigare l’errore, si rieseguisse l’intera operazione due volte con nuove matrici w_q, w_k e w_v e si unissero in qualche modo i due vettori densi ottenuti? In questo modo forse una self-attention è riuscita a cogliere qualche relazione e l’altra è riuscita a cogliere qualche altra relazione.

Ebbene, questo è ciò che accade esattamente in MHSA. Il caso appena discusso contiene due head (teste), perché ha due insiemi di matrici w_q, w_k e w_v. Possiamo avere anche più head: 4, 8, 16, ecc.

L’unica cosa complicata è che tutte queste teste vengono gestite in parallelo, elaborandole tutte nello stesso calcolo utilizzando i tensori.

Il modo in cui uniamo i vettori densi di ogni head è semplice, li concateniamo (quindi la dimensione di ogni vettore deve essere più piccola, in modo che quando li concateniamo otteniamo la dimensione originale che volevamo) e passiamo il vettore ottenuto attraverso un’altra matrice imparabile w_o.

Hands-on

Supponiamo di avere una frase. Dopo la tokenizzazione, ogni token (o parola) corrisponde a un indice (numero):

tokenized_sentence = torch.tensor([
2, #my
6, #name
8, #is
3, #marcello
1 #politi
])
tokenized_sentence

Prima di passare la frase nel transformer, dobbiamo creare una rappresentazione densa per ciascun token.

Come creare questa rappresentazione? Moltiplichiamo ogni token per una matrice. Questa matrice viene appresa durante l’addestramento.

Costruiamo questa matrice, chiamata matrice di embedding.

torch.manual_seed(0) # set a fixed seed for reproducibility
embed = torch.nn.Embedding(10, 16)

Se moltiplichiamo la nostra frase tokenizzata con la matrice di embedding, otteniamo una rappresentazione densa di dimensione 16 per ogni token

sentence_embed = embed(tokenized_sentence).detach()
sentence_embed

Per utilizzare il meccanismo di attenzione dobbiamo creare 3 nuove matrici w_q, w_k e w_v. Moltiplicando un token di ingresso per w_q otteniamo il vettore q. Lo stesso vale per w_k e w_v.

d = sentence_embed.shape[1] # let's base our matrix on a shape (16,16)

w_key = torch.rand(d,d)
w_query = torch.rand(d,d)
w_value = torch.rand(d,d)

Calcolo dei pesi di attenzione

Calcoliamo ora i pesi di attenzione solo per il primo token della frase.

token1_embed = sentence_embed

[0]#compute the tre vector associated to token1 vector : q,k,v
key_1 = w_key.matmul(token1_embed)
query_1 = w_query.matmul(token1_embed)
value_1 = w_value.matmul(token1_embed)

print("key vector for token1: \n", key_1)
print("query vector for token1: \n", query_1)
print("value vector for token1: \n", value_1)

Dobbiamo moltiplicare il vettore query associato al token1 (query_1) con tutte le chiavi degli altri vettori.

Quindi ora dobbiamo calcolare tutte le chiavi (chiave_2, chiave_2, chiave_4, chiave_5). Ma aspettate, possiamo calcolarle tutte in una sola volta moltiplicando sentence_embed per la matrice w_k.

keys = sentence_embed.matmul(w_key.T)
keys[0] #contains the key vector of the first token and so on

Facciamo la stessa cosa con i valori

values = sentence_embed.matmul(w_value.T)
values[0] #contains the value vector of the first token and so on

Calcoliamo la prima parte della formula adesso.

import torch.nn.functional as F

# the following are the attention weights of the first tokens to all the others
a1 = F.softmax(query_1.matmul(keys.T)/d**0.5, dim = 0)
a1

Con i pesi di attenzione sappiamo qual è l’importanza di ciascun token. Quindi ora moltiplichiamo il vettore di valori associato a ogni token per il suo peso.

Per ottenere il vettore finale del token_1 che includa anche il contesto.

z1 = a1.matmul(values)
z1

Allo stesso modo, possiamo calcolare i vettori densi consapevoli del contesto di tutti gli altri token. Ora stiamo utilizzando sempre le stesse matrici w_k, w_q, w_v. Diciamo che usiamo una sola head.

Ma possiamo avere più triplette di matrici, quindi una multi-heads. Ecco perché si chiama multi-head attention.

I vettori densi di un token in ingresso, dati in input a ciascuna head, vengono poi concatenati e trasformati linearmente per ottenere il vettore denso finale.

import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(0) #

# Tokenized sentence (same as yours)
tokenized_sentence = torch.tensor([2, 6, 8, 3, 1]) # [my, name, is, marcello, politi]

# Embedding layer: vocab size = 10, embedding dim = 16
embed = nn.Embedding(10, 16)
sentence_embed = embed(tokenized_sentence).detach() # Shape: [5, 16] (seq_len, embed_dim)

d = sentence_embed.shape[1] # embed dimension 16
h = 4 # Number of heads
d_k = d // h # Dimension per head (16 / 4 = 4)

# Define weight matrices for each head
w_query = torch.rand(h, d, d_k) # Shape: [4, 16, 4] (one d x d_k matrix per head)
w_key = torch.rand(h, d, d_k) # Shape: [4, 16, 4]
w_value = torch.rand(h, d, d_k) # Shape: [4, 16, 4]
w_output = torch.rand(d, d) # Final linear layer: [16, 16]

# Compute Q, K, V for all tokens and all heads
# sentence_embed: [5, 16] -> Q: [4, 5, 4] (h, seq_len, d_k)
queries = torch.einsum('sd,hde->hse', sentence_embed, w_query) # h heads, seq_len tokens, d dim
keys = torch.einsum('sd,hde->hse', sentence_embed, w_key) # h heads, seq_len tokens, d dim
values = torch.einsum('sd,hde->hse', sentence_embed, w_value) # h heads, seq_len tokens, d dim

# Compute attention scores
scores = torch.einsum('hse,hek->hsk', queries, keys.transpose(-2, -1)) / (d_k ** 0.5) # [4, 5, 5]
attention_weights = F.softmax(scores, dim=-1) # [4, 5, 5]

# Apply attention weights
head_outputs = torch.einsum('hij,hjk->hik', attention_weights, values) # [4, 5, 4]
head_outputs.shape

# Concatenate heads
concat_heads = head_outputs.permute(1, 0, 2).reshape(sentence_embed.shape[0], -1) # [5, 16]
concat_heads.shape

multihead_output = concat_heads.matmul(w_output) # [5, 16] @ [16, 16] -> [5, 16]
print("Multi-head attention output for token1:\n", multihead_output[0])

Conclusioni

In questo post ho implementato una versione semplice del meccanismo di attenzione. Questo non è il modo in cui viene realmente implementato nei framework moderni, ma il mio scopo è quello di fornire alcuni spunti per permettere a chiunque di capire come funziona. Nei prossimi articoli analizzerò l’intera implementazione di un’architettura transformer.

L'articolo Intelligenza Artificiale: Implementazione del meccanismo dell’attenzione in Python proviene da il blog della sicurezza informatica.

**LLMs** @LLMs@activitypub.awakari.com · 2d

LLMs @LLMs@activitypub.awakari.com

NeuReality Announces Inference Appliance Is Preloaded with AI Models Caesarea, Israel – May 14,...

https://insidehpc.com/2025/05/neureality-announces-inference-appliance-is-preloaded-with-ai-models/

#Compute #Machine #Learning #News #AI #AI #inference #artificial #intelligence #inference #NeuReality

Result Details

**Kubernetes** @Kubernetes@activitypub.awakari.com · 3d

Kubernetes @Kubernetes@activitypub.awakari.com

New Amazon EC2 P6-B200 instances powered by NVIDIA Blackwell GPUs to accelerate AI innovations Th...

https://aws.amazon.com/blogs/aws/new-amazon-ec2-p6-b200-instances-powered-by-nvidia-blackwell-gpus-to-accelerate-ai-innovations/

#Amazon #EC2 #Compute #Featured #Launch #News

Result Details

Amazon Web Services · 3dNew Amazon EC2 P6-B200 instances powered by NVIDIA Blackwell GPUs to accelerate AI innovations | Amazon Web ServicesThe P6-B200 EC2 instances powered by NVIDIA Blackwell B200 GPUs offer up to twice the performance of previous P5en instances for machine learning and high-performance computing workloads.

**linux** @linux@activitypub.awakari.com · 4d

linux @linux@activitypub.awakari.com

[Launched] Generally Available: App Service Webjobs on Linux WebJobs allow for the execution of b...

https://azure.microsoft.com/updates?id=492316

#Azure #Updates #app #service #compute #Features #Launched #Mobile #Web

Result Details

azure.microsoft.comAzure updates | Microsoft AzureSubscribe to Microsoft Azure today for service updates, all in one place. Check out the new Cloud Platform roadmap to see our latest product plans.

**OpenSource** @OpenSource@activitypub.awakari.com · May 5

May 5

OpenSource @OpenSource@activitypub.awakari.com

Unlock what’s next: Microsoft at Red Hat Summit 2025 Learn more about the solutions that Micros...

https://azure.microsoft.com/en-us/blog/unlock-whats-next-microsoft-at-red-hat-summit-2025/

#Compute #Containers #Management #and #governance #Migration

Result Details

**Nasdaq** @Nasdaq@activitypub.awakari.com · May 6

May 6

Nasdaq @Nasdaq@activitypub.awakari.com

Penguin Solutions Signs AI Infrastructure Deal with CDW Milpitas, Calif. – May 6, 2025 – Pe...

https://insidehpc.com/2025/05/penguin-solutions-signs-ai-infrastructure-deal-with-cdw/

#Compute #Data #Center #Machine #Learning #News #AI #infrastructure #CDW #HPC-AI #infrastructure

Result Details

#hpcai

**Nasdaq** @Nasdaq@activitypub.awakari.com · May 6

May 6

Nasdaq @Nasdaq@activitypub.awakari.com

#hpcai

**OpenSource** @OpenSource@activitypub.awakari.com · May 2

May 2

OpenSource @OpenSource@activitypub.awakari.com

AI Inference: Meta Collaborates with Cerebras on Llama API Sunnyvale, CA — Meta has teamed with...

https://insidehpc.com/2025/05/ai-inference-meta-collaborates-with-cerebras-on-llama-api/

#Compute #CPUs, #GPUs, #FPGAs #Machine #Learning #News #AI #compute #AI #inference

Result Details

#CPUs #gpus

**LLMs** @LLMs@activitypub.awakari.com · May 2

May 2

LLMs @LLMs@activitypub.awakari.com

#CPUs #gpus

**LLMs** @LLMs@activitypub.awakari.com · Apr 28

Apr 28

LLMs @LLMs@activitypub.awakari.com

Accelerate AI innovation and business transformation: Scaling AI transformation with strategic cl...

https://www.microsoft.com/en-us/microsoft-cloud/blog/2025/04/28/accelerate-ai-innovation-and-business-transformation-scaling-ai-transformation-with-strategic-cloud-partnership/

#Compute #Containers

Result Details

**Hacker News** @h4ckernews@mastodon.social · Apr 21

Apr 21

Hacker News @h4ckernews@mastodon.social

The Future of Compute: Nvidia's Crown Is Slipping

https://mohitdagarwal.substack.com/p/from-dominance-to-dilemma-nvidia

Small Fish Big Pond · Oct 25, 2024The Future of Compute: NVIDIA's Crown is SlippingBy Mohit Agarwal

#HackerNews #Nvidia #Future

Continued thread

♬ @peterrenshaw@ioc.exchange · Apr 17

Apr 17

♬ @peterrenshaw@ioc.exchange

Day 19 cont

“He (#PeterDutton) cites #DataCentres in the US where those #tech companies are having conversations with nuclear power providers:

The beauty of an #investment like #nuclear into the #Hunter region for example is you can attract the data centres which is exactly what is happening in the US. #Apple and #Oracle and #Microsoft, or these #companies are willing to spend tens of billions of dollars but they are only having conversations with #NuclearPower providers.”

#Straya gov cant #science or #compute, the LNP are garbage at business. Nuclear generation is #toxic. #Multinationals avoid tax.

#AusPol / #LNP / #Iberal / #Nationals / #Business / #AI / #ArtificialIntelligence <https://www.theguardian.com/australia-news/live/2025/apr/17/australia-election-2025-live-peter-dutton-anthony-albanese-coalition-labor-income-tax-cost-of-living-leaders-debate-ntwnfb?page=with%3Ablock-68006d1c8f08bcf9ff4832be#block-68006d1c8f08bcf9ff4832be>

the Guardian · Apr 17Australia news live: Turnbull says negative gearing ‘examined by every government’; measles warning issued for greater Melbourne areaFollow today’s news live

**Raspberry-Pi** @Raspberry-Pi@activitypub.awakari.com · Apr 16

Apr 16

Raspberry-Pi @Raspberry-Pi@activitypub.awakari.com

Cerebro clusterboard supports up to four Raspberry Pi, NVIDIA Jetson, or Radxa CM5 compute module...

https://liliputing.com/cerebro-clusterboard-supports-up-to-four-raspberry-pi-nvidia-jetson-or-radxa-cm5-compute-modules-crowdfunding/

#News #cerebro #clusterboard #compute #module #crowdfunding #radxa #cm5 #raspberry #pi #cm4

Event Attributes

Liliputing · Apr 16Cerebro clusterboard supports up to four Raspberry Pi, NVIDIA Jetson, or Radxa CM5 compute modules (crowdfunding) - LiliputingCerebro clusterboard supports up to four Raspberry Pi, NVIDIA Jetson, or Radxa CM5 compute modules (crowdfunding)

**LLMs** @LLMs@activitypub.awakari.com · Apr 15

Apr 15

LLMs @LLMs@activitypub.awakari.com

OpenAI Says Rebuilding GPT-4 Now Takes Just 5 To 10 People Creating GPT-4 was once an all-hands-o...

https://wonderfulengineering.com/openai-says-rebuilding-gpt-4-now-takes-just-5-to-10-people/

#News #Technology #AI #development #compute #efficiency #data #bottleneck #GPT-4 #GPT-4.5 #OpenAI

Event Attributes

Wonderful Engineering · Apr 15OpenAI Says Rebuilding GPT-4 Now Takes Just 5 To 10 PeopleCreating GPT-4 was once an all-hands-on-deck operation at OpenAI, involving hundreds of engineers and researchers. But now, thanks to advancements mad

#gpt4 #gpt45

**LLMs** @LLMs@activitypub.awakari.com · Jun 29, 2023

Jun 29, 2023

LLMs @LLMs@activitypub.awakari.com

The Data Center is the New VC The recent Inflection AI fundraising news confirmed a hypothesis I ...

https://deliprao.com/2023/06/the-data-center-is-the-new-vc/

#Essay #compute #governance #GPUs

Event Attributes

deliprao.comThe Data Center is the New VC | Delip Rao

**Habr** @habr@zhub.link · Apr 10

Apr 10

Habr @habr@zhub.link

Декларативный API, деревья поведений и реконсиляция: как мы в MWS строим сервис Compute

Приветствую всех! На связи Родион Цалкин, Tech Product IaaS в MWS. В этой статье расскажу, из каких решений на верхнем уровне состоит сердце MWS — сервис вычислительных ресурсов Compute — и как знания из разных областей помогают найти элегантные решения для возникающих проблем при его создании. Здесь не будет технического deep-dive’а (ждите в следующих статьях), поэтому статья будет интересна широкому кругу читателей.

https://habr.com/ru/companies/mws/articles/899288/

ХабрДекларативный API, деревья поведений и реконсиляция: как мы в MWS строим сервис ComputeПриветствую всех! На связи Родион Цалкин, Tech Product IaaS в MWS. В этой статье расскажу, из каких решений на верхнем уровне состоит сердце MWS — сервис вычислительных ресурсов Compute — и как...

#cloud #compute #виртуализация