SageMaker, Streamlit, Opensearch를 사용한 RAG챗봇 구성하기 2. KoSimCSE-RoBERTAa SageMaker Studio 테스트

AWS/SageMaker

SageMaker, Streamlit, Opensearch를 사용한 RAG챗봇 구성하기 2. KoSimCSE-RoBERTAa SageMaker Studio 테스트

Hyeon Cloud 2023. 11. 6. 08:59

import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained('BM-K/KoSimCSE-roberta')
tokenizer = AutoTokenizer.from_pretrained('BM-K/KoSimCSE-roberta')
sample = "나는 김현민"
inputs = tokenizer(sample, padding=True, truncation=True, return_tensors="pt")
embeddings, _ = model(**inputs, return_dict=False)
emb_len = len(embeddings[0][0])
print("Sample Sentence: \\n", sample)
print("Size of the Embedding Vector: ", emb_len)
print(f"First 10 Elements of the Embedding Vector (Total Elements: {emb_len}): \\n", embeddings[0][0][0:10])
# 코사인 유사도
def cal_score(a, b):
    if len(a.shape) == 1: a = a.unsqueeze(0)
    if len(b.shape) == 1: b = b.unsqueeze(0)

    a_norm = a / a.norm(dim=1)[:, None]
    b_norm = b / b.norm(dim=1)[:, None]
    return torch.mm(a_norm, b_norm.transpose(0, 1)) * 100
def show_embedding_score(tokenizer, model, sentences):
    inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
    embeddings, _ = model(**inputs, return_dict=False)

    score01 = cal_score(embeddings[0][0], embeddings[1][0])
    score02 = cal_score(embeddings[0][0], embeddings[2][0])

    print(score01, score02)
sentences1 = [sample,"나는 김현민",'동해물과 백두산이']

show_embedding_score(tokenizer, model, sentences1)

Sample Sentence:
 나는 김현민
Size of the Embedding Vector:  768
First 10 Elements of the Embedding Vector (Total Elements: 768):
 tensor([ 0.0241, -1.0539,  0.0650, -1.4253,  0.2898, -0.3100, -0.5611,  0.2530,
         0.0636, -0.2037], grad_fn=<SliceBackward0>)
tensor([[100.0000]], grad_fn=<MulBackward0>) tensor([[11.6377]], grad_fn=<MulBackward0>)

SageMaker, Streamlit, Opensearch를 사용한 RAG챗봇 구성하기 2. KoSimCSE-RoBERTAa SageMaker Studio 테스트

목차

빠르게 세이지 메이커 세팅하기

KoSimCSE-RoBERTAa 모델로 벡터임베딩을 진행합니다.