SageMaker, Streamlit, Opensearch를 사용한 RAG챗봇 구성하기 5. FAQ with OpenSearch

%store -r endpoint_name_emb endpoint_name_text
try:
    endpoint_name_emb
    endpoint_name_text
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] TASK-1, TASK-2 노트북을 다시 실행해 주세요.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")

import sys
%load_ext autoreload
%autoreload 2
sys.path.append('./utils') # src 폴더 경로 설정
import json
import time
import boto3
import botocore
import numpy as np
from inference_utils import Prompter
from typing import Any, Dict, List, Optional
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler

prompter = Prompter("kullm")
params = {
      'do_sample': False,
      'max_new_tokens': 128,
      'temperature': 1.0,
      'top_k': 0,
      'top_p': 0.9,
      'return_full_text': False,
      'repetition_penalty': 1.1,
      'presence_penalty': None,
      'eos_token_id': 2
}

class KullmContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        '''
        입력 데이터 전처리 후에 리턴
        '''
        context, question = prompt.split("||SPEPERATOR||") 
        prompt = prompter.generate_prompt(question, context)

        print ("prompt", prompt)
        payload = {
            'inputs': [prompt],
            'parameters': model_kwargs
        }
                           
        input_str = json.dumps(payload)
        
        return input_str.encode('utf-8')
    

    def transform_output(self, output: bytes) -> str:
        
        response_json = json.loads(output.read().decode("utf-8"))              
        generated_text = response_json[0][0]["generated_text"]
        
        return generated_text

aws_region = boto3.Session().region_name
LLMTextContentHandler = KullmContentHandler()
seperator = "||SPEPERATOR||"

llm_text = SagemakerEndpoint(
    endpoint_name=endpoint_name_text,
    region_name=aws_region,
    model_kwargs=params,    
    content_handler=LLMTextContentHandler,
)

임베딩 모델 엔드포인트 핸들러 작성

class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
    def embed_documents(self, texts: List[str], chunk_size: int=1) -> List[List[float]]:
        """Compute doc embeddings using a SageMaker Inference Endpoint.

        Args:
            texts: The list of texts to embed.
            chunk_size: The chunk size defines how many input texts will
                be grouped together as request. If None, will use the
                chunk size specified by the class.

        Returns:
            List of embeddings, one for each text.
        """
        results = []
        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size
        
        print("text size: ", len(texts))
        print("_chunk_size: ", _chunk_size)

        for i in range(0, len(texts), _chunk_size):
            
            #print (i, texts[i : i + _chunk_size])
            response = self._embedding_func(texts[i : i + _chunk_size])
            #print (i, response, len(response[0].shape))
            
            results.extend(response)
        return results

class KoSimCSERobertaContentHandler(EmbeddingsContentHandler):
    
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        
        input_str = json.dumps({"inputs": prompt, **model_kwargs})
        
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        
        response_json = json.loads(output.read().decode("utf-8"))
        ndim = np.array(response_json).ndim    
        
        if ndim == 4:
            # Original shape (1, 1, n, 768)
            emb = response_json[0][0][0]
            emb = np.expand_dims(emb, axis=0).tolist()
        elif ndim == 2:
            # Original shape (n, 1)
            emb = []
            for ele in response_json:
                e = ele[0][0]
                emb.append(e)
        else:
            print(f"Other # of dimension: {ndim}")
            emb = None
        return emb

LLMEmbHandler = KoSimCSERobertaContentHandler()

llm_emb = SagemakerEndpointEmbeddingsJumpStart(
    endpoint_name=endpoint_name_emb,
    region_name=aws_region,
    content_handler=LLMEmbHandler,
)

데이터 로드

CSVLoader를 사용하여 faq데이터를 로드합니다.

import json
import boto3
from langchain.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(
    file_path="./dataset/fsi_smart_faq_ko.csv",
    source_column="Source",
    encoding="utf-8"
)
context_documents = loader.load()

len(context_documents), context_documents[5]

(89, Document(page_content='no: 84\nCategory: 기존 공동인증서를 보유한 상태에서 금융인증서 발급이 가능한가요?\nInformation: 공동인증서와 금융인증서는 별개의 인증서로 두 가지 인증서를 모두 사용할 수 있습니다.\ntype: 금융인증서\nSource: 신한은행', metadata={'source': '신한은행', 'row': 5}))

OpenSearch에 Data 인덱싱

import time
import pprint
import logging
import sagemaker
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter

# global constants
logger = logging.getLogger()
logging.basicConfig(format='%(asctime)s,%(module)s,%(processName)s,%(levelname)s,%(message)s', level=logging.INFO, stream=sys.stderr)

role = sagemaker.get_execution_role()
role

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml

'arn:aws:iam::759320821027:role/service-role/AmazonSageMaker-ExecutionRole-20231030T104069'

사용할 인덱스 이름과 도메인 엔드포인트, 로그인 정보를 입력합니다.

index_name = "fsi-sample"
opensearch_domain_endpoint = "<https://search-hmkim-vectordb-z37b25etffjy4udj5xh7cnhsse.ap-northeast-2.es.amazonaws.com>"
http_auth = ("raguser", "Smileshark12!@")

파이썬용 OpenSearch 라이브러리인 opensearch-py를 인스톨합니다.

! pip install opensearch-py

Installing collected packages: opensearch-py Successfully installed opensearch-py-2.3.2

VectorStore 데이터를 OpenSearch 에 Bulk API를 사용하여 전송합니다.

%%time
logger.info('Loading documents ...')
docs = loader.load()

# # add a custom metadata field, such as timestamp
for doc in docs:
    doc.metadata['timestamp'] = time.time()
    doc.metadata['embeddings_model'] = endpoint_name_emb

text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=200)
documents = text_splitter.split_documents(docs)

# by default langchain would create a k-NN index and the embeddings would be ingested as a k-NN vector type
docsearch = OpenSearchVectorSearch.from_documents(
    index_name=index_name,
    documents=documents,
    embedding=llm_emb,
    opensearch_url=opensearch_domain_endpoint,
    http_auth=http_auth,
    bulk_size=10000,
    timeout=60
)

text size: 90 _chunk_size: 1 CPU times: user 5.85 s, sys: 273 ms, total: 6.13 s Wall time: 18.8 s

OpenSearch 대쉬보드를 사용하여 fsi-sample 인덱스를 사용해보면, 90개의 데이터가 벡터 임베딩화 되어 저장된 것을 확인할 수 있습니다.

🦜🔗LangChain QnA 사용하여 체이닝하기

from functools import lru_cache
from langchain import PromptTemplate
from langchain.chains.question_answering import load_qa_chain

import copy
import functools
import concurrent.futures

prompt_template = ''.join(["{context}", seperator, "{question}"])
PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
chain = load_qa_chain(llm=llm_text, chain_type="stuff", prompt=PROMPT, verbose=True)

vectro_db = OpenSearchVectorSearch(
    index_name=index_name,
    opensearch_url=opensearch_domain_endpoint,
    embedding_function=llm_emb,
    http_auth=http_auth, # http_auth
    is_aoss =False,
    engine="faiss",
    space_type="l2"
)

def pretty_print_documents(response):
    for doc, score in response:
        print(f'\\nScore: {score}')
        print(f'Document Number: {doc.metadata["row"]}')
        print(f'Source: {doc.metadata["source"]}')

        # Split the page content into lines
        lines = doc.page_content.split("\\n")

        # Extract and print each piece of information if it exists
        for line in lines:
            split_line = line.split(": ")
            if len(split_line) > 1:
                print(f'{split_line[0]}: {split_line[1]}')

        print('-' * 50)

def filter_and_remove_score_opensearch_vector_score(res, cutoff_score = 0.006, variance=0.95):
    # Get the lowest score
    highest_score = max(score for doc, score in res)
    print('highest_score : ', highest_score)
    # If the lowest score is over 200, return an empty list
    if highest_score < cutoff_score:
        return []
    # Calculate the upper bound for scores
    lower_bound = highest_score * variance
    print('lower_bound : ', lower_bound)
    # Filter the list and remove the score
    res = [doc for doc, score in res if score >= lower_bound]

    return res

def get_similiar_docs(query, k=5, fetch_k=300, score=True, bank=""):

    
    #query = f'{bank}, {query}'
    print (query)
    
    if score:
        pre_similar_doc = vectro_db.similarity_search_with_score(
            query,
            k=k,
            fetch_k=fetch_k,
            search_type="approximate_search", # approximate_search, script_scoring, painless_scripting
            space_type="l2",     #"l2", "l1", "linf", "cosinesimil", "innerproduct", "hammingbit";
            pre_filter={"bool": {"filter": {"term": {"text": bank}}}},
            boolean_filter={"bool": {"filter": {"term": {"text": bank}}}}
            #filter=dict(source=bank)
        )
        #print('jhs : ', similar_docs)
        pretty_print_documents( pre_similar_doc)
        similar_docs=filter_and_remove_score_opensearch_vector_score(pre_similar_doc)        
    else:
        similar_docs = vectro_db.similarity_search(
            query,
            k=k,
            search_type="approximate_search", # approximate_search, script_scoring, painless_scripting
            space_type="12",     #"l2", "l1", "linf", "cosinesimil", "innerproduct", "hammingbit";
            pre_filter={"bool": {"filter": {"term": {"text": bank}}}},
            boolean_filter={"bool": {"filter": {"term": {"text": bank}}}}
            
        )
    similar_docs_copy = copy.deepcopy(similar_docs)
    
    #print('similar_docs_copy : \\n', similar_docs_copy)
    
    return similar_docs_copy

def get_answer(query, bank="",score=False, fetch_k=300, k=1):
                
    search_query = query
    
    similar_docs = get_similiar_docs(search_query, k=k,score=score, bank=bank)
    

    llm_query = '고객 서비스 센터 직원처럼, '+query+' 카테고리에 대한 Information을 찾아서 설명해주세요.'
    
    if not similar_docs:
        llm_query = query

    answer = chain.run(input_documents=similar_docs, question=llm_query)
    
    return answer

테스트 진행

question ='안녕하세요. 날씨가 참 좋네요.'
response = get_answer(question, bank='신한은행',score=True, k=4)
print("챗봇 : ", response)

Score: 0.0033899078 Document Number: 19 Source: 신한은행 no: 70 Category: 홈페이지상에 제가 등록한 칭찬/불만/제안사항 조회할 수 있나요? Information: 로그인 후 등록한 접수내용에 대해서 확인 가능합니다. type: 홈페이지 Source: 신한은행

Score: 0.0032900404 Document Number: 82 Source: 신한은행 no: 7 Category: 인터넷으로 신규 예/적금 신청하는 방법을 알려주세요 Information: 인터넷상으로 예금/신탁을 신규가입하시려면 우선 고객님께서는인터넷뱅킹에 가입하셔야 하며 신규방법은 두 가지가 있습니다.1. 인터넷뱅킹에서 가입인터넷뱅킹 로그인을 하신 후 예금/신탁 > 신규 메뉴에서 예금 및 신탁 상품을 신규하실 수 있습니다.2. 신한S뱅크에서 가입신한S뱅크 상품센터 > 예금센터 메뉴에서 예금상품을 신규하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.0032718105 Document Number: 49 Source: 신한은행 no: 40 Category: 회원탈퇴 후 메일이 계속와요. Information: 인터넷뱅킹가입을 하시면 예금/대출/카드 등 거래에 대한 안내(예:예금만기 등)외에 영업점안내메일 등 몇가지 부가서비스가 기본제공됩니다. 부가서비스는 홈페이지에 로그인하셔서(인터넷뱅킹사용자는 별도 회원가입이 필요없습니다.) 이메일서비스의 수신/거부 등 변경을 하시면 됩니다. 다만, 인터넷뱅킹을 해지하시더라도 예금,카드,대출 등의 거래가 남아있을 수 있기에 메일서비스는 계속 제공됩니다. 따라서 고객님의 경우에는 기존에 제공되는 메일서비스가 계속 남아있어 부가서비스 메일을 받으신 것이며. 정보가 유출되거나 하는 경우는 절대 없으니 안심하시기 바랍니다. 더이상 부가서비스 이메일 수신을 원하지 않는 경우에는 홈페이지의 회원가입을 하신 후 회원서비스의 이메일서비스에 가셔서 변경하시면 됩니다. type: Source: 신한은행

Score: 0.0032451365 Document Number: 81 Source: 신한은행 no: 8 Category: 보안메일서비스 안내해줘 Information: 신한은행은 이메일서비스를 통해 고객님의 거래정보와 금융정보를 메일로 알려드리는데 고객님께 발송되는 이메일 중 개인정보보호가 필요한 메일(거래정보 등)은 암호화 처리되어 보안메일로 발송되어 일반메일 (홍보 및 안내메일) 과 구별됩니다. ※ 입출내역 통지서비스는 개인뱅킹 > 뱅킹보안서비스 > 통지서비스 > 입출내역 Email통지서비스 메뉴에서 서비스 신청 및 변경이 가능합니다. type: 인터넷뱅킹 Source: 신한은행

highest_score : 0.0033899078

Entering new StuffDocumentsChain chain...> Entering new LLMChain chain... Prompt after formatting: ||SPEPERATOR||안녕하세요. 날씨가 참 좋네요. prompt 아래는 작업을 설명하는 명령어입니다. 요청을 적절히 완료하는 응답을 작성하세요.

명령어:

안녕하세요. 날씨가 참 좋네요.

응답:

Finished chain.> Finished chain. 챗봇 : 안녕하세요! 오늘은 무엇을 도와드릴까요?

q ='자동이체 서비스는 어떻게 신청해야 하나요?'
response = get_answer(q, bank='신한은행',score=True, k=5)

print("챗봇 : ", response)

자동이체 서비스는 어떻게 신청해야 하나요?

Score: 0.007392953 Document Number: 86 Source: 신한은행 no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.006629911 Document Number: 67 Source: 신한은행 no: 22 Category: 인터넷 예적금 해약하려면 어떻게 해야 하나요? Information: 인터넷에서 신규하셨고, 이후 통장발급을 받지 않으셨다면 인터넷뱅킹(http://bank.shinhan.com)의 금융상품 예금/신탁 해지 메뉴를 통해 해지하실 수 있습니다. type: Source: 신한은행

Score: 0.006477051 Document Number: 53 Source: 신한은행 no: 36 Category: 이메일서비스 신청 및 해지 방법은? Information: 이메일서비스는 인터넷뱅킹의 서비스메뉴 중 뱅킹보안서비스 > 입출내역통지서비스 > 입출내역 E-Mail통지서비스에서 신청 및 해지하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.006469134 Document Number: 57 Source: 신한은행 no: 32 Category: 휴대폰통지서비스 신청 방법은? Information: 휴대폰 통지서비스는 본인의 금융거래내역을 거래발생 즉시 등록된 이동통신 단말기로 통지해 주는 서비스입니다. 휴대폰통지서비스 신청방법 ① 개인뱅킹 로그인 ② 뱅킹보안서비스 ③ 입출내역통지서비스 ④ S알리미 서비스와 입출내역 SMS통지서비스 중 선택 휴대폰통지서비스 특징 S알리미 서비스 type: Source: 신한은행

Score: 0.0060997857 Document Number: 58 Source: 신한은행 no: 31 Category: 보안카드 번호 입력은 어떻게 하면 되죠? Information: 씨크리트(보안)카드는 고객님이 은행에서 신한온라인서비스(인터넷뱅킹)에 가입하신 후 받은 카드입니다. 보안카드번호가 필요한 모든 거래에 기타 궁금하신 내용은 신한은행 고객센터 1599-8000로 문의하여 주시기 바랍니다. type: 인터넷뱅킹 Source: 신한은행

highest_score : 0.007392953 lower_bound : 0.007023305349999999

Entering new StuffDocumentsChain chain...> Entering new LLMChain chain... Prompt after formatting: no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행||SPEPERATOR||고객 서비스 센터 직원처럼, 자동이체 서비스는 어떻게 신청해야 하나요? 카테고리에 대한 Information을 찾아서 설명해주세요. prompt 아래는 작업을 설명하는 명령어와 추가 컨텍스트를 제공하는 입력이 짝을 이루는 예제입니다. 요청을 적절히 완료하는 응답을 작성하세요.

명령어:

고객 서비스 센터 직원처럼, 자동이체 서비스는 어떻게 신청해야 하나요? 카테고리에 대한 Information을 찾아서 설명해주세요.

입력:

no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

응답:

Finished chain.> Finished chain. 챗봇 : 인터넷뱅킹에 로그인하고 "지로자동이체 등록" 메뉴를 찾은 다음, "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함한 각종 지로 요금"을 자동이체로 등록할 수 있습니다. 이 정보는 "은행" 카테고리에서 찾을 수 있으며, "인터넷뱅킹" 카테고리에서도 찾을 수 있습니다.

FAQ with OpenSearch - Vector Store Test

이번에는 Local FAISS 말고 Amazon OpenSearch 서비스를 Vector DB로 사용해보겠습니다.

ragostest.ipynb

OpenSearch 클러스터 생성

DB로 활용될 OpenSearch 클러스터를 생성하도록 하겠습니다.

🔗Amazon OpenSearch Service 로 이동하여 도메인을 생성합니다.

Engine options: OpenSearch_2.9

Network: Public access

도메인이 생성되면, 보안구성 탭으로 이동하여 액세스 정책을 수정합니다.

Effect : Deny → Allow

대시보드 URL과 도메인 엔드포인트를 기록해둡니다.

KULLM 엔드포인트 핸들러 작성
%store -r endpoint_name_emb endpoint_name_text
try:
    endpoint_name_emb
    endpoint_name_text
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] TASK-1, TASK-2 노트북을 다시 실행해 주세요.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
import sys
%load_ext autoreload
%autoreload 2
sys.path.append('./utils') # src 폴더 경로 설정
import json
import time
import boto3
import botocore
import numpy as np
from inference_utils import Prompter
from typing import Any, Dict, List, Optional
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
prompter = Prompter("kullm")
params = {
      'do_sample': False,
      'max_new_tokens': 128,
      'temperature': 1.0,
      'top_k': 0,
      'top_p': 0.9,
      'return_full_text': False,
      'repetition_penalty': 1.1,
      'presence_penalty': None,
      'eos_token_id': 2
}

class KullmContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        '''
        입력 데이터 전처리 후에 리턴
        '''
        context, question = prompt.split("||SPEPERATOR||") 
        prompt = prompter.generate_prompt(question, context)

        print ("prompt", prompt)
        payload = {
            'inputs': [prompt],
            'parameters': model_kwargs
        }
                           
        input_str = json.dumps(payload)
        
        return input_str.encode('utf-8')
    

    def transform_output(self, output: bytes) -> str:
        
        response_json = json.loads(output.read().decode("utf-8"))              
        generated_text = response_json[0][0]["generated_text"]
        
        return generated_text
aws_region = boto3.Session().region_name
LLMTextContentHandler = KullmContentHandler()
seperator = "||SPEPERATOR||"

llm_text = SagemakerEndpoint(
    endpoint_name=endpoint_name_text,
    region_name=aws_region,
    model_kwargs=params,    
    content_handler=LLMTextContentHandler,
)
임베딩 모델 엔드포인트 핸들러 작성
class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
    def embed_documents(self, texts: List[str], chunk_size: int=1) -> List[List[float]]:
        """Compute doc embeddings using a SageMaker Inference Endpoint.

        Args:
            texts: The list of texts to embed.
            chunk_size: The chunk size defines how many input texts will
                be grouped together as request. If None, will use the
                chunk size specified by the class.

        Returns:
            List of embeddings, one for each text.
        """
        results = []
        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size
        
        print("text size: ", len(texts))
        print("_chunk_size: ", _chunk_size)

        for i in range(0, len(texts), _chunk_size):
            
            #print (i, texts[i : i + _chunk_size])
            response = self._embedding_func(texts[i : i + _chunk_size])
            #print (i, response, len(response[0].shape))
            
            results.extend(response)
        return results
class KoSimCSERobertaContentHandler(EmbeddingsContentHandler):
    
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        
        input_str = json.dumps({"inputs": prompt, **model_kwargs})
        
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        
        response_json = json.loads(output.read().decode("utf-8"))
        ndim = np.array(response_json).ndim    
        
        if ndim == 4:
            # Original shape (1, 1, n, 768)
            emb = response_json[0][0][0]
            emb = np.expand_dims(emb, axis=0).tolist()
        elif ndim == 2:
            # Original shape (n, 1)
            emb = []
            for ele in response_json:
                e = ele[0][0]
                emb.append(e)
        else:
            print(f"Other # of dimension: {ndim}")
            emb = None
        return emb
LLMEmbHandler = KoSimCSERobertaContentHandler()

llm_emb = SagemakerEndpointEmbeddingsJumpStart(
    endpoint_name=endpoint_name_emb,
    region_name=aws_region,
    content_handler=LLMEmbHandler,
)
데이터 로드

CSVLoader를 사용하여 faq데이터를 로드합니다.
import json
import boto3
from langchain.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(
    file_path="./dataset/fsi_smart_faq_ko.csv",
    source_column="Source",
    encoding="utf-8"
)
context_documents = loader.load()
len(context_documents), context_documents[5]
(89, Document(page_content='no: 84\nCategory: 기존 공동인증서를 보유한 상태에서 금융인증서 발급이 가능한가요?\nInformation: 공동인증서와 금융인증서는 별개의 인증서로 두 가지 인증서를 모두 사용할 수 있습니다.\ntype: 금융인증서\nSource: 신한은행', metadata={'source': '신한은행', 'row': 5}))

OpenSearch에 Data 인덱싱
import time
import pprint
import logging
import sagemaker
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
# global constants
logger = logging.getLogger()
logging.basicConfig(format='%(asctime)s,%(module)s,%(processName)s,%(levelname)s,%(message)s', level=logging.INFO, stream=sys.stderr)

role = sagemaker.get_execution_role()
role
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml

'arn:aws:iam::759320821027:role/service-role/AmazonSageMaker-ExecutionRole-20231030T104069'

사용할 인덱스 이름과 도메인 엔드포인트, 로그인 정보를 입력합니다.
index_name = "fsi-sample"
opensearch_domain_endpoint = "<https://search-hmkim-vectordb-z37b25etffjy4udj5xh7cnhsse.ap-northeast-2.es.amazonaws.com>"
http_auth = ("raguser", "Smileshark12!@")
파이썬용 OpenSearch 라이브러리인 opensearch-py를 인스톨합니다.
! pip install opensearch-py
Installing collected packages: opensearch-py Successfully installed opensearch-py-2.3.2

VectorStore 데이터를 OpenSearch 에 Bulk API를 사용하여 전송합니다.
%%time
logger.info('Loading documents ...')
docs = loader.load()

# # add a custom metadata field, such as timestamp
for doc in docs:
    doc.metadata['timestamp'] = time.time()
    doc.metadata['embeddings_model'] = endpoint_name_emb

text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=200)
documents = text_splitter.split_documents(docs)

# by default langchain would create a k-NN index and the embeddings would be ingested as a k-NN vector type
docsearch = OpenSearchVectorSearch.from_documents(
    index_name=index_name,
    documents=documents,
    embedding=llm_emb,
    opensearch_url=opensearch_domain_endpoint,
    http_auth=http_auth,
    bulk_size=10000,
    timeout=60
)
text size: 90 _chunk_size: 1 CPU times: user 5.85 s, sys: 273 ms, total: 6.13 s Wall time: 18.8 s

OpenSearch 대쉬보드를 사용하여 fsi-sample 인덱스를 사용해보면, 90개의 데이터가 벡터 임베딩화 되어 저장된 것을 확인할 수 있습니다.

🦜🔗LangChain QnA 사용하여 체이닝하기
from functools import lru_cache
from langchain import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
import copy
import functools
import concurrent.futures
prompt_template = ''.join(["{context}", seperator, "{question}"])
PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
chain = load_qa_chain(llm=llm_text, chain_type="stuff", prompt=PROMPT, verbose=True)
vectro_db = OpenSearchVectorSearch(
    index_name=index_name,
    opensearch_url=opensearch_domain_endpoint,
    embedding_function=llm_emb,
    http_auth=http_auth, # http_auth
    is_aoss =False,
    engine="faiss",
    space_type="l2"
)
def pretty_print_documents(response):
    for doc, score in response:
        print(f'\\nScore: {score}')
        print(f'Document Number: {doc.metadata["row"]}')
        print(f'Source: {doc.metadata["source"]}')

        # Split the page content into lines
        lines = doc.page_content.split("\\n")

        # Extract and print each piece of information if it exists
        for line in lines:
            split_line = line.split(": ")
            if len(split_line) > 1:
                print(f'{split_line[0]}: {split_line[1]}')

        print('-' * 50)
def filter_and_remove_score_opensearch_vector_score(res, cutoff_score = 0.006, variance=0.95):
    # Get the lowest score
    highest_score = max(score for doc, score in res)
    print('highest_score : ', highest_score)
    # If the lowest score is over 200, return an empty list
    if highest_score < cutoff_score:
        return []
    # Calculate the upper bound for scores
    lower_bound = highest_score * variance
    print('lower_bound : ', lower_bound)
    # Filter the list and remove the score
    res = [doc for doc, score in res if score >= lower_bound]

    return res

def get_similiar_docs(query, k=5, fetch_k=300, score=True, bank=""):

    
    #query = f'{bank}, {query}'
    print (query)
    
    if score:
        pre_similar_doc = vectro_db.similarity_search_with_score(
            query,
            k=k,
            fetch_k=fetch_k,
            search_type="approximate_search", # approximate_search, script_scoring, painless_scripting
            space_type="l2",     #"l2", "l1", "linf", "cosinesimil", "innerproduct", "hammingbit";
            pre_filter={"bool": {"filter": {"term": {"text": bank}}}},
            boolean_filter={"bool": {"filter": {"term": {"text": bank}}}}
            #filter=dict(source=bank)
        )
        #print('jhs : ', similar_docs)
        pretty_print_documents( pre_similar_doc)
        similar_docs=filter_and_remove_score_opensearch_vector_score(pre_similar_doc)        
    else:
        similar_docs = vectro_db.similarity_search(
            query,
            k=k,
            search_type="approximate_search", # approximate_search, script_scoring, painless_scripting
            space_type="12",     #"l2", "l1", "linf", "cosinesimil", "innerproduct", "hammingbit";
            pre_filter={"bool": {"filter": {"term": {"text": bank}}}},
            boolean_filter={"bool": {"filter": {"term": {"text": bank}}}}
            
        )
    similar_docs_copy = copy.deepcopy(similar_docs)
    
    #print('similar_docs_copy : \\n', similar_docs_copy)
    
    return similar_docs_copy

def get_answer(query, bank="",score=False, fetch_k=300, k=1):
                
    search_query = query
    
    similar_docs = get_similiar_docs(search_query, k=k,score=score, bank=bank)
    

    llm_query = '고객 서비스 센터 직원처럼, '+query+' 카테고리에 대한 Information을 찾아서 설명해주세요.'
    
    if not similar_docs:
        llm_query = query

    answer = chain.run(input_documents=similar_docs, question=llm_query)
    
    return answer
테스트 진행
question ='안녕하세요. 날씨가 참 좋네요.'
response = get_answer(question, bank='신한은행',score=True, k=4)
print("챗봇 : ", response)
Score: 0.0033899078 Document Number: 19 Source: 신한은행 no: 70 Category: 홈페이지상에 제가 등록한 칭찬/불만/제안사항 조회할 수 있나요? Information: 로그인 후 등록한 접수내용에 대해서 확인 가능합니다. type: 홈페이지 Source: 신한은행

Score: 0.0032900404 Document Number: 82 Source: 신한은행 no: 7 Category: 인터넷으로 신규 예/적금 신청하는 방법을 알려주세요 Information: 인터넷상으로 예금/신탁을 신규가입하시려면 우선 고객님께서는인터넷뱅킹에 가입하셔야 하며 신규방법은 두 가지가 있습니다.1. 인터넷뱅킹에서 가입인터넷뱅킹 로그인을 하신 후 예금/신탁 > 신규 메뉴에서 예금 및 신탁 상품을 신규하실 수 있습니다.2. 신한S뱅크에서 가입신한S뱅크 상품센터 > 예금센터 메뉴에서 예금상품을 신규하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.0032718105 Document Number: 49 Source: 신한은행 no: 40 Category: 회원탈퇴 후 메일이 계속와요. Information: 인터넷뱅킹가입을 하시면 예금/대출/카드 등 거래에 대한 안내(예:예금만기 등)외에 영업점안내메일 등 몇가지 부가서비스가 기본제공됩니다. 부가서비스는 홈페이지에 로그인하셔서(인터넷뱅킹사용자는 별도 회원가입이 필요없습니다.) 이메일서비스의 수신/거부 등 변경을 하시면 됩니다. 다만, 인터넷뱅킹을 해지하시더라도 예금,카드,대출 등의 거래가 남아있을 수 있기에 메일서비스는 계속 제공됩니다. 따라서 고객님의 경우에는 기존에 제공되는 메일서비스가 계속 남아있어 부가서비스 메일을 받으신 것이며. 정보가 유출되거나 하는 경우는 절대 없으니 안심하시기 바랍니다. 더이상 부가서비스 이메일 수신을 원하지 않는 경우에는 홈페이지의 회원가입을 하신 후 회원서비스의 이메일서비스에 가셔서 변경하시면 됩니다. type: Source: 신한은행

Score: 0.0032451365 Document Number: 81 Source: 신한은행 no: 8 Category: 보안메일서비스 안내해줘 Information: 신한은행은 이메일서비스를 통해 고객님의 거래정보와 금융정보를 메일로 알려드리는데 고객님께 발송되는 이메일 중 개인정보보호가 필요한 메일(거래정보 등)은 암호화 처리되어 보안메일로 발송되어 일반메일 (홍보 및 안내메일) 과 구별됩니다. ※ 입출내역 통지서비스는 개인뱅킹 > 뱅킹보안서비스 > 통지서비스 > 입출내역 Email통지서비스 메뉴에서 서비스 신청 및 변경이 가능합니다. type: 인터넷뱅킹 Source: 신한은행

highest_score : 0.0033899078

Entering new StuffDocumentsChain chain...> Entering new LLMChain chain... Prompt after formatting: ||SPEPERATOR||안녕하세요. 날씨가 참 좋네요. prompt 아래는 작업을 설명하는 명령어입니다. 요청을 적절히 완료하는 응답을 작성하세요.

명령어:

안녕하세요. 날씨가 참 좋네요.

응답:

Finished chain.> Finished chain. 챗봇 : 안녕하세요! 오늘은 무엇을 도와드릴까요?
q ='자동이체 서비스는 어떻게 신청해야 하나요?'
response = get_answer(q, bank='신한은행',score=True, k=5)

print("챗봇 : ", response)
자동이체 서비스는 어떻게 신청해야 하나요?

Score: 0.007392953 Document Number: 86 Source: 신한은행 no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.006629911 Document Number: 67 Source: 신한은행 no: 22 Category: 인터넷 예적금 해약하려면 어떻게 해야 하나요? Information: 인터넷에서 신규하셨고, 이후 통장발급을 받지 않으셨다면 인터넷뱅킹(http://bank.shinhan.com)의 금융상품 예금/신탁 해지 메뉴를 통해 해지하실 수 있습니다. type: Source: 신한은행

Score: 0.006477051 Document Number: 53 Source: 신한은행 no: 36 Category: 이메일서비스 신청 및 해지 방법은? Information: 이메일서비스는 인터넷뱅킹의 서비스메뉴 중 뱅킹보안서비스 > 입출내역통지서비스 > 입출내역 E-Mail통지서비스에서 신청 및 해지하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.006469134 Document Number: 57 Source: 신한은행 no: 32 Category: 휴대폰통지서비스 신청 방법은? Information: 휴대폰 통지서비스는 본인의 금융거래내역을 거래발생 즉시 등록된 이동통신 단말기로 통지해 주는 서비스입니다. 휴대폰통지서비스 신청방법 ① 개인뱅킹 로그인 ② 뱅킹보안서비스 ③ 입출내역통지서비스 ④ S알리미 서비스와 입출내역 SMS통지서비스 중 선택 휴대폰통지서비스 특징 S알리미 서비스 type: Source: 신한은행

Score: 0.0060997857 Document Number: 58 Source: 신한은행 no: 31 Category: 보안카드 번호 입력은 어떻게 하면 되죠? Information: 씨크리트(보안)카드는 고객님이 은행에서 신한온라인서비스(인터넷뱅킹)에 가입하신 후 받은 카드입니다. 보안카드번호가 필요한 모든 거래에 기타 궁금하신 내용은 신한은행 고객센터 1599-8000로 문의하여 주시기 바랍니다. type: 인터넷뱅킹 Source: 신한은행

highest_score : 0.007392953 lower_bound : 0.007023305349999999

Entering new StuffDocumentsChain chain...> Entering new LLMChain chain... Prompt after formatting: no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행||SPEPERATOR||고객 서비스 센터 직원처럼, 자동이체 서비스는 어떻게 신청해야 하나요? 카테고리에 대한 Information을 찾아서 설명해주세요. prompt 아래는 작업을 설명하는 명령어와 추가 컨텍스트를 제공하는 입력이 짝을 이루는 예제입니다. 요청을 적절히 완료하는 응답을 작성하세요.

명령어:

고객 서비스 센터 직원처럼, 자동이체 서비스는 어떻게 신청해야 하나요? 카테고리에 대한 Information을 찾아서 설명해주세요.

입력:

no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

응답:
Finished chain.> Finished chain. 챗봇 : 인터넷뱅킹에 로그인하고 "지로자동이체 등록" 메뉴를 찾은 다음, "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함한 각종 지로 요금"을 자동이체로 등록할 수 있습니다. 이 정보는 "은행" 카테고리에서 찾을 수 있으며, "인터넷뱅킹" 카테고리에서도 찾을 수 있습니다.

FAQ with OpenSearch - Vector Store Test

이번에는 Local FAISS 말고 Amazon OpenSearch 서비스를 Vector DB로 사용해보겠습니다.

ragostest.ipynb

OpenSearch 클러스터 생성

DB로 활용될 OpenSearch 클러스터를 생성하도록 하겠습니다.

🔗Amazon OpenSearch Service 로 이동하여 도메인을 생성합니다.

Engine options: OpenSearch_2.9

Network: Public access

도메인이 생성되면, 보안구성 탭으로 이동하여 액세스 정책을 수정합니다.

Effect : Deny → Allow

대시보드 URL과 도메인 엔드포인트를 기록해둡니다.

KULLM 엔드포인트 핸들러 작성
%store -r endpoint_name_emb endpoint_name_text
try:
    endpoint_name_emb
    endpoint_name_text
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] TASK-1, TASK-2 노트북을 다시 실행해 주세요.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
import sys
%load_ext autoreload
%autoreload 2
sys.path.append('./utils') # src 폴더 경로 설정
import json
import time
import boto3
import botocore
import numpy as np
from inference_utils import Prompter
from typing import Any, Dict, List, Optional
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
prompter = Prompter("kullm")
params = {
      'do_sample': False,
      'max_new_tokens': 128,
      'temperature': 1.0,
      'top_k': 0,
      'top_p': 0.9,
      'return_full_text': False,
      'repetition_penalty': 1.1,
      'presence_penalty': None,
      'eos_token_id': 2
}

class KullmContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        '''
        입력 데이터 전처리 후에 리턴
        '''
        context, question = prompt.split("||SPEPERATOR||") 
        prompt = prompter.generate_prompt(question, context)

        print ("prompt", prompt)
        payload = {
            'inputs': [prompt],
            'parameters': model_kwargs
        }
                           
        input_str = json.dumps(payload)
        
        return input_str.encode('utf-8')
    

    def transform_output(self, output: bytes) -> str:
        
        response_json = json.loads(output.read().decode("utf-8"))              
        generated_text = response_json[0][0]["generated_text"]
        
        return generated_text
aws_region = boto3.Session().region_name
LLMTextContentHandler = KullmContentHandler()
seperator = "||SPEPERATOR||"

llm_text = SagemakerEndpoint(
    endpoint_name=endpoint_name_text,
    region_name=aws_region,
    model_kwargs=params,    
    content_handler=LLMTextContentHandler,
)
임베딩 모델 엔드포인트 핸들러 작성
class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
    def embed_documents(self, texts: List[str], chunk_size: int=1) -> List[List[float]]:
        """Compute doc embeddings using a SageMaker Inference Endpoint.

        Args:
            texts: The list of texts to embed.
            chunk_size: The chunk size defines how many input texts will
                be grouped together as request. If None, will use the
                chunk size specified by the class.

        Returns:
            List of embeddings, one for each text.
        """
        results = []
        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size
        
        print("text size: ", len(texts))
        print("_chunk_size: ", _chunk_size)

        for i in range(0, len(texts), _chunk_size):
            
            #print (i, texts[i : i + _chunk_size])
            response = self._embedding_func(texts[i : i + _chunk_size])
            #print (i, response, len(response[0].shape))
            
            results.extend(response)
        return results
class KoSimCSERobertaContentHandler(EmbeddingsContentHandler):
    
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        
        input_str = json.dumps({"inputs": prompt, **model_kwargs})
        
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        
        response_json = json.loads(output.read().decode("utf-8"))
        ndim = np.array(response_json).ndim    
        
        if ndim == 4:
            # Original shape (1, 1, n, 768)
            emb = response_json[0][0][0]
            emb = np.expand_dims(emb, axis=0).tolist()
        elif ndim == 2:
            # Original shape (n, 1)
            emb = []
            for ele in response_json:
                e = ele[0][0]
                emb.append(e)
        else:
            print(f"Other # of dimension: {ndim}")
            emb = None
        return emb
LLMEmbHandler = KoSimCSERobertaContentHandler()

llm_emb = SagemakerEndpointEmbeddingsJumpStart(
    endpoint_name=endpoint_name_emb,
    region_name=aws_region,
    content_handler=LLMEmbHandler,
)
데이터 로드

CSVLoader를 사용하여 faq데이터를 로드합니다.
import json
import boto3
from langchain.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(
    file_path="./dataset/fsi_smart_faq_ko.csv",
    source_column="Source",
    encoding="utf-8"
)
context_documents = loader.load()
len(context_documents), context_documents[5]
(89, Document(page_content='no: 84\nCategory: 기존 공동인증서를 보유한 상태에서 금융인증서 발급이 가능한가요?\nInformation: 공동인증서와 금융인증서는 별개의 인증서로 두 가지 인증서를 모두 사용할 수 있습니다.\ntype: 금융인증서\nSource: 신한은행', metadata={'source': '신한은행', 'row': 5}))

OpenSearch에 Data 인덱싱
import time
import pprint
import logging
import sagemaker
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
# global constants
logger = logging.getLogger()
logging.basicConfig(format='%(asctime)s,%(module)s,%(processName)s,%(levelname)s,%(message)s', level=logging.INFO, stream=sys.stderr)

role = sagemaker.get_execution_role()
role
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml

'arn:aws:iam::759320821027:role/service-role/AmazonSageMaker-ExecutionRole-20231030T104069'

사용할 인덱스 이름과 도메인 엔드포인트, 로그인 정보를 입력합니다.
index_name = "fsi-sample"
opensearch_domain_endpoint = "<https://search-hmkim-vectordb-z37b25etffjy4udj5xh7cnhsse.ap-northeast-2.es.amazonaws.com>"
http_auth = ("raguser", "Smileshark12!@")
파이썬용 OpenSearch 라이브러리인 opensearch-py를 인스톨합니다.
! pip install opensearch-py
Installing collected packages: opensearch-py Successfully installed opensearch-py-2.3.2

VectorStore 데이터를 OpenSearch 에 Bulk API를 사용하여 전송합니다.
%%time
logger.info('Loading documents ...')
docs = loader.load()

# # add a custom metadata field, such as timestamp
for doc in docs:
    doc.metadata['timestamp'] = time.time()
    doc.metadata['embeddings_model'] = endpoint_name_emb

text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=200)
documents = text_splitter.split_documents(docs)

# by default langchain would create a k-NN index and the embeddings would be ingested as a k-NN vector type
docsearch = OpenSearchVectorSearch.from_documents(
    index_name=index_name,
    documents=documents,
    embedding=llm_emb,
    opensearch_url=opensearch_domain_endpoint,
    http_auth=http_auth,
    bulk_size=10000,
    timeout=60
)
text size: 90 _chunk_size: 1 CPU times: user 5.85 s, sys: 273 ms, total: 6.13 s Wall time: 18.8 s

OpenSearch 대쉬보드를 사용하여 fsi-sample 인덱스를 사용해보면, 90개의 데이터가 벡터 임베딩화 되어 저장된 것을 확인할 수 있습니다.

🦜🔗LangChain QnA 사용하여 체이닝하기
from functools import lru_cache
from langchain import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
import copy
import functools
import concurrent.futures
prompt_template = ''.join(["{context}", seperator, "{question}"])
PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
chain = load_qa_chain(llm=llm_text, chain_type="stuff", prompt=PROMPT, verbose=True)
vectro_db = OpenSearchVectorSearch(
    index_name=index_name,
    opensearch_url=opensearch_domain_endpoint,
    embedding_function=llm_emb,
    http_auth=http_auth, # http_auth
    is_aoss =False,
    engine="faiss",
    space_type="l2"
)
def pretty_print_documents(response):
    for doc, score in response:
        print(f'\\nScore: {score}')
        print(f'Document Number: {doc.metadata["row"]}')
        print(f'Source: {doc.metadata["source"]}')

        # Split the page content into lines
        lines = doc.page_content.split("\\n")

        # Extract and print each piece of information if it exists
        for line in lines:
            split_line = line.split(": ")
            if len(split_line) > 1:
                print(f'{split_line[0]}: {split_line[1]}')

        print('-' * 50)
def filter_and_remove_score_opensearch_vector_score(res, cutoff_score = 0.006, variance=0.95):
    # Get the lowest score
    highest_score = max(score for doc, score in res)
    print('highest_score : ', highest_score)
    # If the lowest score is over 200, return an empty list
    if highest_score < cutoff_score:
        return []
    # Calculate the upper bound for scores
    lower_bound = highest_score * variance
    print('lower_bound : ', lower_bound)
    # Filter the list and remove the score
    res = [doc for doc, score in res if score >= lower_bound]

    return res

def get_similiar_docs(query, k=5, fetch_k=300, score=True, bank=""):

    
    #query = f'{bank}, {query}'
    print (query)
    
    if score:
        pre_similar_doc = vectro_db.similarity_search_with_score(
            query,
            k=k,
            fetch_k=fetch_k,
            search_type="approximate_search", # approximate_search, script_scoring, painless_scripting
            space_type="l2",     #"l2", "l1", "linf", "cosinesimil", "innerproduct", "hammingbit";
            pre_filter={"bool": {"filter": {"term": {"text": bank}}}},
            boolean_filter={"bool": {"filter": {"term": {"text": bank}}}}
            #filter=dict(source=bank)
        )
        #print('jhs : ', similar_docs)
        pretty_print_documents( pre_similar_doc)
        similar_docs=filter_and_remove_score_opensearch_vector_score(pre_similar_doc)        
    else:
        similar_docs = vectro_db.similarity_search(
            query,
            k=k,
            search_type="approximate_search", # approximate_search, script_scoring, painless_scripting
            space_type="12",     #"l2", "l1", "linf", "cosinesimil", "innerproduct", "hammingbit";
            pre_filter={"bool": {"filter": {"term": {"text": bank}}}},
            boolean_filter={"bool": {"filter": {"term": {"text": bank}}}}
            
        )
    similar_docs_copy = copy.deepcopy(similar_docs)
    
    #print('similar_docs_copy : \\n', similar_docs_copy)
    
    return similar_docs_copy

def get_answer(query, bank="",score=False, fetch_k=300, k=1):
                
    search_query = query
    
    similar_docs = get_similiar_docs(search_query, k=k,score=score, bank=bank)
    

    llm_query = '고객 서비스 센터 직원처럼, '+query+' 카테고리에 대한 Information을 찾아서 설명해주세요.'
    
    if not similar_docs:
        llm_query = query

    answer = chain.run(input_documents=similar_docs, question=llm_query)
    
    return answer
테스트 진행
question ='안녕하세요. 날씨가 참 좋네요.'
response = get_answer(question, bank='신한은행',score=True, k=4)
print("챗봇 : ", response)
Score: 0.0033899078 Document Number: 19 Source: 신한은행 no: 70 Category: 홈페이지상에 제가 등록한 칭찬/불만/제안사항 조회할 수 있나요? Information: 로그인 후 등록한 접수내용에 대해서 확인 가능합니다. type: 홈페이지 Source: 신한은행

Score: 0.0032900404 Document Number: 82 Source: 신한은행 no: 7 Category: 인터넷으로 신규 예/적금 신청하는 방법을 알려주세요 Information: 인터넷상으로 예금/신탁을 신규가입하시려면 우선 고객님께서는인터넷뱅킹에 가입하셔야 하며 신규방법은 두 가지가 있습니다.1. 인터넷뱅킹에서 가입인터넷뱅킹 로그인을 하신 후 예금/신탁 > 신규 메뉴에서 예금 및 신탁 상품을 신규하실 수 있습니다.2. 신한S뱅크에서 가입신한S뱅크 상품센터 > 예금센터 메뉴에서 예금상품을 신규하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.0032718105 Document Number: 49 Source: 신한은행 no: 40 Category: 회원탈퇴 후 메일이 계속와요. Information: 인터넷뱅킹가입을 하시면 예금/대출/카드 등 거래에 대한 안내(예:예금만기 등)외에 영업점안내메일 등 몇가지 부가서비스가 기본제공됩니다. 부가서비스는 홈페이지에 로그인하셔서(인터넷뱅킹사용자는 별도 회원가입이 필요없습니다.) 이메일서비스의 수신/거부 등 변경을 하시면 됩니다. 다만, 인터넷뱅킹을 해지하시더라도 예금,카드,대출 등의 거래가 남아있을 수 있기에 메일서비스는 계속 제공됩니다. 따라서 고객님의 경우에는 기존에 제공되는 메일서비스가 계속 남아있어 부가서비스 메일을 받으신 것이며. 정보가 유출되거나 하는 경우는 절대 없으니 안심하시기 바랍니다. 더이상 부가서비스 이메일 수신을 원하지 않는 경우에는 홈페이지의 회원가입을 하신 후 회원서비스의 이메일서비스에 가셔서 변경하시면 됩니다. type: Source: 신한은행

Score: 0.0032451365 Document Number: 81 Source: 신한은행 no: 8 Category: 보안메일서비스 안내해줘 Information: 신한은행은 이메일서비스를 통해 고객님의 거래정보와 금융정보를 메일로 알려드리는데 고객님께 발송되는 이메일 중 개인정보보호가 필요한 메일(거래정보 등)은 암호화 처리되어 보안메일로 발송되어 일반메일 (홍보 및 안내메일) 과 구별됩니다. ※ 입출내역 통지서비스는 개인뱅킹 > 뱅킹보안서비스 > 통지서비스 > 입출내역 Email통지서비스 메뉴에서 서비스 신청 및 변경이 가능합니다. type: 인터넷뱅킹 Source: 신한은행

highest_score : 0.0033899078

Entering new StuffDocumentsChain chain...> Entering new LLMChain chain... Prompt after formatting: ||SPEPERATOR||안녕하세요. 날씨가 참 좋네요. prompt 아래는 작업을 설명하는 명령어입니다. 요청을 적절히 완료하는 응답을 작성하세요.

명령어:

안녕하세요. 날씨가 참 좋네요.

응답:

Finished chain.> Finished chain. 챗봇 : 안녕하세요! 오늘은 무엇을 도와드릴까요?
q ='자동이체 서비스는 어떻게 신청해야 하나요?'
response = get_answer(q, bank='신한은행',score=True, k=5)

print("챗봇 : ", response)
자동이체 서비스는 어떻게 신청해야 하나요?

Score: 0.007392953 Document Number: 86 Source: 신한은행 no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.006629911 Document Number: 67 Source: 신한은행 no: 22 Category: 인터넷 예적금 해약하려면 어떻게 해야 하나요? Information: 인터넷에서 신규하셨고, 이후 통장발급을 받지 않으셨다면 인터넷뱅킹(http://bank.shinhan.com)의 금융상품 예금/신탁 해지 메뉴를 통해 해지하실 수 있습니다. type: Source: 신한은행

Score: 0.006477051 Document Number: 53 Source: 신한은행 no: 36 Category: 이메일서비스 신청 및 해지 방법은? Information: 이메일서비스는 인터넷뱅킹의 서비스메뉴 중 뱅킹보안서비스 > 입출내역통지서비스 > 입출내역 E-Mail통지서비스에서 신청 및 해지하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.006469134 Document Number: 57 Source: 신한은행 no: 32 Category: 휴대폰통지서비스 신청 방법은? Information: 휴대폰 통지서비스는 본인의 금융거래내역을 거래발생 즉시 등록된 이동통신 단말기로 통지해 주는 서비스입니다. 휴대폰통지서비스 신청방법 ① 개인뱅킹 로그인 ② 뱅킹보안서비스 ③ 입출내역통지서비스 ④ S알리미 서비스와 입출내역 SMS통지서비스 중 선택 휴대폰통지서비스 특징 S알리미 서비스 type: Source: 신한은행

Score: 0.0060997857 Document Number: 58 Source: 신한은행 no: 31 Category: 보안카드 번호 입력은 어떻게 하면 되죠? Information: 씨크리트(보안)카드는 고객님이 은행에서 신한온라인서비스(인터넷뱅킹)에 가입하신 후 받은 카드입니다. 보안카드번호가 필요한 모든 거래에 기타 궁금하신 내용은 신한은행 고객센터 1599-8000로 문의하여 주시기 바랍니다. type: 인터넷뱅킹 Source: 신한은행

highest_score : 0.007392953 lower_bound : 0.007023305349999999

Entering new StuffDocumentsChain chain...> Entering new LLMChain chain... Prompt after formatting: no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행||SPEPERATOR||고객 서비스 센터 직원처럼, 자동이체 서비스는 어떻게 신청해야 하나요? 카테고리에 대한 Information을 찾아서 설명해주세요. prompt 아래는 작업을 설명하는 명령어와 추가 컨텍스트를 제공하는 입력이 짝을 이루는 예제입니다. 요청을 적절히 완료하는 응답을 작성하세요.

명령어:

고객 서비스 센터 직원처럼, 자동이체 서비스는 어떻게 신청해야 하나요? 카테고리에 대한 Information을 찾아서 설명해주세요.

입력:

no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

응답:
Finished chain.> Finished chain. 챗봇 : 인터넷뱅킹에 로그인하고 "지로자동이체 등록" 메뉴를 찾은 다음, "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함한 각종 지로 요금"을 자동이체로 등록할 수 있습니다. 이 정보는 "은행" 카테고리에서 찾을 수 있으며, "인터넷뱅킹" 카테고리에서도 찾을 수 있습니다.

FAQ with OpenSearch - Vector Store Test

이번에는 Local FAISS 말고 Amazon OpenSearch 서비스를 Vector DB로 사용해보겠습니다.

ragostest.ipynb

OpenSearch 클러스터 생성

DB로 활용될 OpenSearch 클러스터를 생성하도록 하겠습니다.

🔗Amazon OpenSearch Service 로 이동하여 도메인을 생성합니다.

Engine options: OpenSearch_2.9

Network: Public access

도메인이 생성되면, 보안구성 탭으로 이동하여 액세스 정책을 수정합니다.

Effect : Deny → Allow

대시보드 URL과 도메인 엔드포인트를 기록해둡니다.

KULLM 엔드포인트 핸들러 작성
%store -r endpoint_name_emb endpoint_name_text
try:
    endpoint_name_emb
    endpoint_name_text
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] TASK-1, TASK-2 노트북을 다시 실행해 주세요.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
import sys
%load_ext autoreload
%autoreload 2
sys.path.append('./utils') # src 폴더 경로 설정
import json
import time
import boto3
import botocore
import numpy as np
from inference_utils import Prompter
from typing import Any, Dict, List, Optional
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
prompter = Prompter("kullm")
params = {
      'do_sample': False,
      'max_new_tokens': 128,
      'temperature': 1.0,
      'top_k': 0,
      'top_p': 0.9,
      'return_full_text': False,
      'repetition_penalty': 1.1,
      'presence_penalty': None,
      'eos_token_id': 2
}

class KullmContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        '''
        입력 데이터 전처리 후에 리턴
        '''
        context, question = prompt.split("||SPEPERATOR||") 
        prompt = prompter.generate_prompt(question, context)

        print ("prompt", prompt)
        payload = {
            'inputs': [prompt],
            'parameters': model_kwargs
        }
                           
        input_str = json.dumps(payload)
        
        return input_str.encode('utf-8')
    

    def transform_output(self, output: bytes) -> str:
        
        response_json = json.loads(output.read().decode("utf-8"))              
        generated_text = response_json[0][0]["generated_text"]
        
        return generated_text
aws_region = boto3.Session().region_name
LLMTextContentHandler = KullmContentHandler()
seperator = "||SPEPERATOR||"

llm_text = SagemakerEndpoint(
    endpoint_name=endpoint_name_text,
    region_name=aws_region,
    model_kwargs=params,    
    content_handler=LLMTextContentHandler,
)
임베딩 모델 엔드포인트 핸들러 작성
class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
    def embed_documents(self, texts: List[str], chunk_size: int=1) -> List[List[float]]:
        """Compute doc embeddings using a SageMaker Inference Endpoint.

        Args:
            texts: The list of texts to embed.
            chunk_size: The chunk size defines how many input texts will
                be grouped together as request. If None, will use the
                chunk size specified by the class.

        Returns:
            List of embeddings, one for each text.
        """
        results = []
        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size
        
        print("text size: ", len(texts))
        print("_chunk_size: ", _chunk_size)

        for i in range(0, len(texts), _chunk_size):
            
            #print (i, texts[i : i + _chunk_size])
            response = self._embedding_func(texts[i : i + _chunk_size])
            #print (i, response, len(response[0].shape))
            
            results.extend(response)
        return results
class KoSimCSERobertaContentHandler(EmbeddingsContentHandler):
    
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        
        input_str = json.dumps({"inputs": prompt, **model_kwargs})
        
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        
        response_json = json.loads(output.read().decode("utf-8"))
        ndim = np.array(response_json).ndim    
        
        if ndim == 4:
            # Original shape (1, 1, n, 768)
            emb = response_json[0][0][0]
            emb = np.expand_dims(emb, axis=0).tolist()
        elif ndim == 2:
            # Original shape (n, 1)
            emb = []
            for ele in response_json:
                e = ele[0][0]
                emb.append(e)
        else:
            print(f"Other # of dimension: {ndim}")
            emb = None
        return emb
LLMEmbHandler = KoSimCSERobertaContentHandler()

llm_emb = SagemakerEndpointEmbeddingsJumpStart(
    endpoint_name=endpoint_name_emb,
    region_name=aws_region,
    content_handler=LLMEmbHandler,
)
데이터 로드

CSVLoader를 사용하여 faq데이터를 로드합니다.
import json
import boto3
from langchain.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(
    file_path="./dataset/fsi_smart_faq_ko.csv",
    source_column="Source",
    encoding="utf-8"
)
context_documents = loader.load()
len(context_documents), context_documents[5]
(89, Document(page_content='no: 84\nCategory: 기존 공동인증서를 보유한 상태에서 금융인증서 발급이 가능한가요?\nInformation: 공동인증서와 금융인증서는 별개의 인증서로 두 가지 인증서를 모두 사용할 수 있습니다.\ntype: 금융인증서\nSource: 신한은행', metadata={'source': '신한은행', 'row': 5}))

OpenSearch에 Data 인덱싱
import time
import pprint
import logging
import sagemaker
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
# global constants
logger = logging.getLogger()
logging.basicConfig(format='%(asctime)s,%(module)s,%(processName)s,%(levelname)s,%(message)s', level=logging.INFO, stream=sys.stderr)

role = sagemaker.get_execution_role()
role
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml

'arn:aws:iam::759320821027:role/service-role/AmazonSageMaker-ExecutionRole-20231030T104069'

사용할 인덱스 이름과 도메인 엔드포인트, 로그인 정보를 입력합니다.
index_name = "fsi-sample"
opensearch_domain_endpoint = "<https://search-hmkim-vectordb-z37b25etffjy4udj5xh7cnhsse.ap-northeast-2.es.amazonaws.com>"
http_auth = ("raguser", "Smileshark12!@")
파이썬용 OpenSearch 라이브러리인 opensearch-py를 인스톨합니다.
! pip install opensearch-py
Installing collected packages: opensearch-py Successfully installed opensearch-py-2.3.2

VectorStore 데이터를 OpenSearch 에 Bulk API를 사용하여 전송합니다.
%%time
logger.info('Loading documents ...')
docs = loader.load()

# # add a custom metadata field, such as timestamp
for doc in docs:
    doc.metadata['timestamp'] = time.time()
    doc.metadata['embeddings_model'] = endpoint_name_emb

text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=200)
documents = text_splitter.split_documents(docs)

# by default langchain would create a k-NN index and the embeddings would be ingested as a k-NN vector type
docsearch = OpenSearchVectorSearch.from_documents(
    index_name=index_name,
    documents=documents,
    embedding=llm_emb,
    opensearch_url=opensearch_domain_endpoint,
    http_auth=http_auth,
    bulk_size=10000,
    timeout=60
)
text size: 90 _chunk_size: 1 CPU times: user 5.85 s, sys: 273 ms, total: 6.13 s Wall time: 18.8 s

OpenSearch 대쉬보드를 사용하여 fsi-sample 인덱스를 사용해보면, 90개의 데이터가 벡터 임베딩화 되어 저장된 것을 확인할 수 있습니다.

🦜🔗LangChain QnA 사용하여 체이닝하기
from functools import lru_cache
from langchain import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
import copy
import functools
import concurrent.futures
prompt_template = ''.join(["{context}", seperator, "{question}"])
PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
chain = load_qa_chain(llm=llm_text, chain_type="stuff", prompt=PROMPT, verbose=True)
vectro_db = OpenSearchVectorSearch(
    index_name=index_name,
    opensearch_url=opensearch_domain_endpoint,
    embedding_function=llm_emb,
    http_auth=http_auth, # http_auth
    is_aoss =False,
    engine="faiss",
    space_type="l2"
)
def pretty_print_documents(response):
    for doc, score in response:
        print(f'\\nScore: {score}')
        print(f'Document Number: {doc.metadata["row"]}')
        print(f'Source: {doc.metadata["source"]}')

        # Split the page content into lines
        lines = doc.page_content.split("\\n")

        # Extract and print each piece of information if it exists
        for line in lines:
            split_line = line.split(": ")
            if len(split_line) > 1:
                print(f'{split_line[0]}: {split_line[1]}')

        print('-' * 50)
def filter_and_remove_score_opensearch_vector_score(res, cutoff_score = 0.006, variance=0.95):
    # Get the lowest score
    highest_score = max(score for doc, score in res)
    print('highest_score : ', highest_score)
    # If the lowest score is over 200, return an empty list
    if highest_score < cutoff_score:
        return []
    # Calculate the upper bound for scores
    lower_bound = highest_score * variance
    print('lower_bound : ', lower_bound)
    # Filter the list and remove the score
    res = [doc for doc, score in res if score >= lower_bound]

    return res

def get_similiar_docs(query, k=5, fetch_k=300, score=True, bank=""):

    
    #query = f'{bank}, {query}'
    print (query)
    
    if score:
        pre_similar_doc = vectro_db.similarity_search_with_score(
            query,
            k=k,
            fetch_k=fetch_k,
            search_type="approximate_search", # approximate_search, script_scoring, painless_scripting
            space_type="l2",     #"l2", "l1", "linf", "cosinesimil", "innerproduct", "hammingbit";
            pre_filter={"bool": {"filter": {"term": {"text": bank}}}},
            boolean_filter={"bool": {"filter": {"term": {"text": bank}}}}
            #filter=dict(source=bank)
        )
        #print('jhs : ', similar_docs)
        pretty_print_documents( pre_similar_doc)
        similar_docs=filter_and_remove_score_opensearch_vector_score(pre_similar_doc)        
    else:
        similar_docs = vectro_db.similarity_search(
            query,
            k=k,
            search_type="approximate_search", # approximate_search, script_scoring, painless_scripting
            space_type="12",     #"l2", "l1", "linf", "cosinesimil", "innerproduct", "hammingbit";
            pre_filter={"bool": {"filter": {"term": {"text": bank}}}},
            boolean_filter={"bool": {"filter": {"term": {"text": bank}}}}
            
        )
    similar_docs_copy = copy.deepcopy(similar_docs)
    
    #print('similar_docs_copy : \\n', similar_docs_copy)
    
    return similar_docs_copy

def get_answer(query, bank="",score=False, fetch_k=300, k=1):
                
    search_query = query
    
    similar_docs = get_similiar_docs(search_query, k=k,score=score, bank=bank)
    

    llm_query = '고객 서비스 센터 직원처럼, '+query+' 카테고리에 대한 Information을 찾아서 설명해주세요.'
    
    if not similar_docs:
        llm_query = query

    answer = chain.run(input_documents=similar_docs, question=llm_query)
    
    return answer
테스트 진행
question ='안녕하세요. 날씨가 참 좋네요.'
response = get_answer(question, bank='신한은행',score=True, k=4)
print("챗봇 : ", response)
Score: 0.0033899078 Document Number: 19 Source: 신한은행 no: 70 Category: 홈페이지상에 제가 등록한 칭찬/불만/제안사항 조회할 수 있나요? Information: 로그인 후 등록한 접수내용에 대해서 확인 가능합니다. type: 홈페이지 Source: 신한은행

Score: 0.0032900404 Document Number: 82 Source: 신한은행 no: 7 Category: 인터넷으로 신규 예/적금 신청하는 방법을 알려주세요 Information: 인터넷상으로 예금/신탁을 신규가입하시려면 우선 고객님께서는인터넷뱅킹에 가입하셔야 하며 신규방법은 두 가지가 있습니다.1. 인터넷뱅킹에서 가입인터넷뱅킹 로그인을 하신 후 예금/신탁 > 신규 메뉴에서 예금 및 신탁 상품을 신규하실 수 있습니다.2. 신한S뱅크에서 가입신한S뱅크 상품센터 > 예금센터 메뉴에서 예금상품을 신규하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.0032718105 Document Number: 49 Source: 신한은행 no: 40 Category: 회원탈퇴 후 메일이 계속와요. Information: 인터넷뱅킹가입을 하시면 예금/대출/카드 등 거래에 대한 안내(예:예금만기 등)외에 영업점안내메일 등 몇가지 부가서비스가 기본제공됩니다. 부가서비스는 홈페이지에 로그인하셔서(인터넷뱅킹사용자는 별도 회원가입이 필요없습니다.) 이메일서비스의 수신/거부 등 변경을 하시면 됩니다. 다만, 인터넷뱅킹을 해지하시더라도 예금,카드,대출 등의 거래가 남아있을 수 있기에 메일서비스는 계속 제공됩니다. 따라서 고객님의 경우에는 기존에 제공되는 메일서비스가 계속 남아있어 부가서비스 메일을 받으신 것이며. 정보가 유출되거나 하는 경우는 절대 없으니 안심하시기 바랍니다. 더이상 부가서비스 이메일 수신을 원하지 않는 경우에는 홈페이지의 회원가입을 하신 후 회원서비스의 이메일서비스에 가셔서 변경하시면 됩니다. type: Source: 신한은행

Score: 0.0032451365 Document Number: 81 Source: 신한은행 no: 8 Category: 보안메일서비스 안내해줘 Information: 신한은행은 이메일서비스를 통해 고객님의 거래정보와 금융정보를 메일로 알려드리는데 고객님께 발송되는 이메일 중 개인정보보호가 필요한 메일(거래정보 등)은 암호화 처리되어 보안메일로 발송되어 일반메일 (홍보 및 안내메일) 과 구별됩니다. ※ 입출내역 통지서비스는 개인뱅킹 > 뱅킹보안서비스 > 통지서비스 > 입출내역 Email통지서비스 메뉴에서 서비스 신청 및 변경이 가능합니다. type: 인터넷뱅킹 Source: 신한은행

highest_score : 0.0033899078

Entering new StuffDocumentsChain chain...> Entering new LLMChain chain... Prompt after formatting: ||SPEPERATOR||안녕하세요. 날씨가 참 좋네요. prompt 아래는 작업을 설명하는 명령어입니다. 요청을 적절히 완료하는 응답을 작성하세요.

명령어:

안녕하세요. 날씨가 참 좋네요.

응답:

Finished chain.> Finished chain. 챗봇 : 안녕하세요! 오늘은 무엇을 도와드릴까요?
q ='자동이체 서비스는 어떻게 신청해야 하나요?'
response = get_answer(q, bank='신한은행',score=True, k=5)

print("챗봇 : ", response)
자동이체 서비스는 어떻게 신청해야 하나요?

Score: 0.007392953 Document Number: 86 Source: 신한은행 no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.006629911 Document Number: 67 Source: 신한은행 no: 22 Category: 인터넷 예적금 해약하려면 어떻게 해야 하나요? Information: 인터넷에서 신규하셨고, 이후 통장발급을 받지 않으셨다면 인터넷뱅킹(http://bank.shinhan.com)의 금융상품 예금/신탁 해지 메뉴를 통해 해지하실 수 있습니다. type: Source: 신한은행

Score: 0.006477051 Document Number: 53 Source: 신한은행 no: 36 Category: 이메일서비스 신청 및 해지 방법은? Information: 이메일서비스는 인터넷뱅킹의 서비스메뉴 중 뱅킹보안서비스 > 입출내역통지서비스 > 입출내역 E-Mail통지서비스에서 신청 및 해지하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.006469134 Document Number: 57 Source: 신한은행 no: 32 Category: 휴대폰통지서비스 신청 방법은? Information: 휴대폰 통지서비스는 본인의 금융거래내역을 거래발생 즉시 등록된 이동통신 단말기로 통지해 주는 서비스입니다. 휴대폰통지서비스 신청방법 ① 개인뱅킹 로그인 ② 뱅킹보안서비스 ③ 입출내역통지서비스 ④ S알리미 서비스와 입출내역 SMS통지서비스 중 선택 휴대폰통지서비스 특징 S알리미 서비스 type: Source: 신한은행

Score: 0.0060997857 Document Number: 58 Source: 신한은행 no: 31 Category: 보안카드 번호 입력은 어떻게 하면 되죠? Information: 씨크리트(보안)카드는 고객님이 은행에서 신한온라인서비스(인터넷뱅킹)에 가입하신 후 받은 카드입니다. 보안카드번호가 필요한 모든 거래에 기타 궁금하신 내용은 신한은행 고객센터 1599-8000로 문의하여 주시기 바랍니다. type: 인터넷뱅킹 Source: 신한은행

highest_score : 0.007392953 lower_bound : 0.007023305349999999

Entering new StuffDocumentsChain chain...> Entering new LLMChain chain... Prompt after formatting: no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행||SPEPERATOR||고객 서비스 센터 직원처럼, 자동이체 서비스는 어떻게 신청해야 하나요? 카테고리에 대한 Information을 찾아서 설명해주세요. prompt 아래는 작업을 설명하는 명령어와 추가 컨텍스트를 제공하는 입력이 짝을 이루는 예제입니다. 요청을 적절히 완료하는 응답을 작성하세요.

명령어:

고객 서비스 센터 직원처럼, 자동이체 서비스는 어떻게 신청해야 하나요? 카테고리에 대한 Information을 찾아서 설명해주세요.

입력:

no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

응답:
Finished chain.> Finished chain. 챗봇 : 인터넷뱅킹에 로그인하고 "지로자동이체 등록" 메뉴를 찾은 다음, "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함한 각종 지로 요금"을 자동이체로 등록할 수 있습니다. 이 정보는 "은행" 카테고리에서 찾을 수 있으며, "인터넷뱅킹" 카테고리에서도 찾을 수 있습니다.

FAQ with OpenSearch - Vector Store Test

이번에는 Local FAISS 말고 Amazon OpenSearch 서비스를 Vector DB로 사용해보겠습니다.

ragostest.ipynb

OpenSearch 클러스터 생성

DB로 활용될 OpenSearch 클러스터를 생성하도록 하겠습니다.

🔗Amazon OpenSearch Service 로 이동하여 도메인을 생성합니다.

Engine options: OpenSearch_2.9

Network: Public access

도메인이 생성되면, 보안구성 탭으로 이동하여 액세스 정책을 수정합니다.

Effect : Deny → Allow

대시보드 URL과 도메인 엔드포인트를 기록해둡니다.

KULLM 엔드포인트 핸들러 작성
%store -r endpoint_name_emb endpoint_name_text
try:
    endpoint_name_emb
    endpoint_name_text
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] TASK-1, TASK-2 노트북을 다시 실행해 주세요.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
import sys
%load_ext autoreload
%autoreload 2
sys.path.append('./utils') # src 폴더 경로 설정
import json
import time
import boto3
import botocore
import numpy as np
from inference_utils import Prompter
from typing import Any, Dict, List, Optional
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
prompter = Prompter("kullm")
params = {
      'do_sample': False,
      'max_new_tokens': 128,
      'temperature': 1.0,
      'top_k': 0,
      'top_p': 0.9,
      'return_full_text': False,
      'repetition_penalty': 1.1,
      'presence_penalty': None,
      'eos_token_id': 2
}

class KullmContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        '''
        입력 데이터 전처리 후에 리턴
        '''
        context, question = prompt.split("||SPEPERATOR||") 
        prompt = prompter.generate_prompt(question, context)

        print ("prompt", prompt)
        payload = {
            'inputs': [prompt],
            'parameters': model_kwargs
        }
                           
        input_str = json.dumps(payload)
        
        return input_str.encode('utf-8')
    

    def transform_output(self, output: bytes) -> str:
        
        response_json = json.loads(output.read().decode("utf-8"))              
        generated_text = response_json[0][0]["generated_text"]
        
        return generated_text
aws_region = boto3.Session().region_name
LLMTextContentHandler = KullmContentHandler()
seperator = "||SPEPERATOR||"

llm_text = SagemakerEndpoint(
    endpoint_name=endpoint_name_text,
    region_name=aws_region,
    model_kwargs=params,    
    content_handler=LLMTextContentHandler,
)
임베딩 모델 엔드포인트 핸들러 작성
class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
    def embed_documents(self, texts: List[str], chunk_size: int=1) -> List[List[float]]:
        """Compute doc embeddings using a SageMaker Inference Endpoint.

        Args:
            texts: The list of texts to embed.
            chunk_size: The chunk size defines how many input texts will
                be grouped together as request. If None, will use the
                chunk size specified by the class.

        Returns:
            List of embeddings, one for each text.
        """
        results = []
        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size
        
        print("text size: ", len(texts))
        print("_chunk_size: ", _chunk_size)

        for i in range(0, len(texts), _chunk_size):
            
            #print (i, texts[i : i + _chunk_size])
            response = self._embedding_func(texts[i : i + _chunk_size])
            #print (i, response, len(response[0].shape))
            
            results.extend(response)
        return results
class KoSimCSERobertaContentHandler(EmbeddingsContentHandler):
    
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        
        input_str = json.dumps({"inputs": prompt, **model_kwargs})
        
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        
        response_json = json.loads(output.read().decode("utf-8"))
        ndim = np.array(response_json).ndim    
        
        if ndim == 4:
            # Original shape (1, 1, n, 768)
            emb = response_json[0][0][0]
            emb = np.expand_dims(emb, axis=0).tolist()
        elif ndim == 2:
            # Original shape (n, 1)
            emb = []
            for ele in response_json:
                e = ele[0][0]
                emb.append(e)
        else:
            print(f"Other # of dimension: {ndim}")
            emb = None
        return emb
LLMEmbHandler = KoSimCSERobertaContentHandler()

llm_emb = SagemakerEndpointEmbeddingsJumpStart(
    endpoint_name=endpoint_name_emb,
    region_name=aws_region,
    content_handler=LLMEmbHandler,
)
데이터 로드

CSVLoader를 사용하여 faq데이터를 로드합니다.
import json
import boto3
from langchain.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(
    file_path="./dataset/fsi_smart_faq_ko.csv",
    source_column="Source",
    encoding="utf-8"
)
context_documents = loader.load()
len(context_documents), context_documents[5]
(89, Document(page_content='no: 84\nCategory: 기존 공동인증서를 보유한 상태에서 금융인증서 발급이 가능한가요?\nInformation: 공동인증서와 금융인증서는 별개의 인증서로 두 가지 인증서를 모두 사용할 수 있습니다.\ntype: 금융인증서\nSource: 신한은행', metadata={'source': '신한은행', 'row': 5}))

OpenSearch에 Data 인덱싱
import time
import pprint
import logging
import sagemaker
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
# global constants
logger = logging.getLogger()
logging.basicConfig(format='%(asctime)s,%(module)s,%(processName)s,%(levelname)s,%(message)s', level=logging.INFO, stream=sys.stderr)

role = sagemaker.get_execution_role()
role
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml

'arn:aws:iam::759320821027:role/service-role/AmazonSageMaker-ExecutionRole-20231030T104069'

사용할 인덱스 이름과 도메인 엔드포인트, 로그인 정보를 입력합니다.
index_name = "fsi-sample"
opensearch_domain_endpoint = "<https://search-hmkim-vectordb-z37b25etffjy4udj5xh7cnhsse.ap-northeast-2.es.amazonaws.com>"
http_auth = ("raguser", "Smileshark12!@")
파이썬용 OpenSearch 라이브러리인 opensearch-py를 인스톨합니다.
! pip install opensearch-py
Installing collected packages: opensearch-py Successfully installed opensearch-py-2.3.2

VectorStore 데이터를 OpenSearch 에 Bulk API를 사용하여 전송합니다.
%%time
logger.info('Loading documents ...')
docs = loader.load()

# # add a custom metadata field, such as timestamp
for doc in docs:
    doc.metadata['timestamp'] = time.time()
    doc.metadata['embeddings_model'] = endpoint_name_emb

text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=200)
documents = text_splitter.split_documents(docs)

# by default langchain would create a k-NN index and the embeddings would be ingested as a k-NN vector type
docsearch = OpenSearchVectorSearch.from_documents(
    index_name=index_name,
    documents=documents,
    embedding=llm_emb,
    opensearch_url=opensearch_domain_endpoint,
    http_auth=http_auth,
    bulk_size=10000,
    timeout=60
)
text size: 90 _chunk_size: 1 CPU times: user 5.85 s, sys: 273 ms, total: 6.13 s Wall time: 18.8 s

OpenSearch 대쉬보드를 사용하여 fsi-sample 인덱스를 사용해보면, 90개의 데이터가 벡터 임베딩화 되어 저장된 것을 확인할 수 있습니다.

🦜🔗LangChain QnA 사용하여 체이닝하기
from functools import lru_cache
from langchain import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
import copy
import functools
import concurrent.futures
prompt_template = ''.join(["{context}", seperator, "{question}"])
PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
chain = load_qa_chain(llm=llm_text, chain_type="stuff", prompt=PROMPT, verbose=True)
vectro_db = OpenSearchVectorSearch(
    index_name=index_name,
    opensearch_url=opensearch_domain_endpoint,
    embedding_function=llm_emb,
    http_auth=http_auth, # http_auth
    is_aoss =False,
    engine="faiss",
    space_type="l2"
)
def pretty_print_documents(response):
    for doc, score in response:
        print(f'\\nScore: {score}')
        print(f'Document Number: {doc.metadata["row"]}')
        print(f'Source: {doc.metadata["source"]}')

        # Split the page content into lines
        lines = doc.page_content.split("\\n")

        # Extract and print each piece of information if it exists
        for line in lines:
            split_line = line.split(": ")
            if len(split_line) > 1:
                print(f'{split_line[0]}: {split_line[1]}')

        print('-' * 50)
def filter_and_remove_score_opensearch_vector_score(res, cutoff_score = 0.006, variance=0.95):
    # Get the lowest score
    highest_score = max(score for doc, score in res)
    print('highest_score : ', highest_score)
    # If the lowest score is over 200, return an empty list
    if highest_score < cutoff_score:
        return []
    # Calculate the upper bound for scores
    lower_bound = highest_score * variance
    print('lower_bound : ', lower_bound)
    # Filter the list and remove the score
    res = [doc for doc, score in res if score >= lower_bound]

    return res

def get_similiar_docs(query, k=5, fetch_k=300, score=True, bank=""):

    
    #query = f'{bank}, {query}'
    print (query)
    
    if score:
        pre_similar_doc = vectro_db.similarity_search_with_score(
            query,
            k=k,
            fetch_k=fetch_k,
            search_type="approximate_search", # approximate_search, script_scoring, painless_scripting
            space_type="l2",     #"l2", "l1", "linf", "cosinesimil", "innerproduct", "hammingbit";
            pre_filter={"bool": {"filter": {"term": {"text": bank}}}},
            boolean_filter={"bool": {"filter": {"term": {"text": bank}}}}
            #filter=dict(source=bank)
        )
        #print('jhs : ', similar_docs)
        pretty_print_documents( pre_similar_doc)
        similar_docs=filter_and_remove_score_opensearch_vector_score(pre_similar_doc)        
    else:
        similar_docs = vectro_db.similarity_search(
            query,
            k=k,
            search_type="approximate_search", # approximate_search, script_scoring, painless_scripting
            space_type="12",     #"l2", "l1", "linf", "cosinesimil", "innerproduct", "hammingbit";
            pre_filter={"bool": {"filter": {"term": {"text": bank}}}},
            boolean_filter={"bool": {"filter": {"term": {"text": bank}}}}
            
        )
    similar_docs_copy = copy.deepcopy(similar_docs)
    
    #print('similar_docs_copy : \\n', similar_docs_copy)
    
    return similar_docs_copy

def get_answer(query, bank="",score=False, fetch_k=300, k=1):
                
    search_query = query
    
    similar_docs = get_similiar_docs(search_query, k=k,score=score, bank=bank)
    

    llm_query = '고객 서비스 센터 직원처럼, '+query+' 카테고리에 대한 Information을 찾아서 설명해주세요.'
    
    if not similar_docs:
        llm_query = query

    answer = chain.run(input_documents=similar_docs, question=llm_query)
    
    return answer
테스트 진행
question ='안녕하세요. 날씨가 참 좋네요.'
response = get_answer(question, bank='신한은행',score=True, k=4)
print("챗봇 : ", response)
Score: 0.0033899078 Document Number: 19 Source: 신한은행 no: 70 Category: 홈페이지상에 제가 등록한 칭찬/불만/제안사항 조회할 수 있나요? Information: 로그인 후 등록한 접수내용에 대해서 확인 가능합니다. type: 홈페이지 Source: 신한은행

Score: 0.0032900404 Document Number: 82 Source: 신한은행 no: 7 Category: 인터넷으로 신규 예/적금 신청하는 방법을 알려주세요 Information: 인터넷상으로 예금/신탁을 신규가입하시려면 우선 고객님께서는인터넷뱅킹에 가입하셔야 하며 신규방법은 두 가지가 있습니다.1. 인터넷뱅킹에서 가입인터넷뱅킹 로그인을 하신 후 예금/신탁 > 신규 메뉴에서 예금 및 신탁 상품을 신규하실 수 있습니다.2. 신한S뱅크에서 가입신한S뱅크 상품센터 > 예금센터 메뉴에서 예금상품을 신규하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.0032718105 Document Number: 49 Source: 신한은행 no: 40 Category: 회원탈퇴 후 메일이 계속와요. Information: 인터넷뱅킹가입을 하시면 예금/대출/카드 등 거래에 대한 안내(예:예금만기 등)외에 영업점안내메일 등 몇가지 부가서비스가 기본제공됩니다. 부가서비스는 홈페이지에 로그인하셔서(인터넷뱅킹사용자는 별도 회원가입이 필요없습니다.) 이메일서비스의 수신/거부 등 변경을 하시면 됩니다. 다만, 인터넷뱅킹을 해지하시더라도 예금,카드,대출 등의 거래가 남아있을 수 있기에 메일서비스는 계속 제공됩니다. 따라서 고객님의 경우에는 기존에 제공되는 메일서비스가 계속 남아있어 부가서비스 메일을 받으신 것이며. 정보가 유출되거나 하는 경우는 절대 없으니 안심하시기 바랍니다. 더이상 부가서비스 이메일 수신을 원하지 않는 경우에는 홈페이지의 회원가입을 하신 후 회원서비스의 이메일서비스에 가셔서 변경하시면 됩니다. type: Source: 신한은행

Score: 0.0032451365 Document Number: 81 Source: 신한은행 no: 8 Category: 보안메일서비스 안내해줘 Information: 신한은행은 이메일서비스를 통해 고객님의 거래정보와 금융정보를 메일로 알려드리는데 고객님께 발송되는 이메일 중 개인정보보호가 필요한 메일(거래정보 등)은 암호화 처리되어 보안메일로 발송되어 일반메일 (홍보 및 안내메일) 과 구별됩니다. ※ 입출내역 통지서비스는 개인뱅킹 > 뱅킹보안서비스 > 통지서비스 > 입출내역 Email통지서비스 메뉴에서 서비스 신청 및 변경이 가능합니다. type: 인터넷뱅킹 Source: 신한은행

highest_score : 0.0033899078

Entering new StuffDocumentsChain chain...> Entering new LLMChain chain... Prompt after formatting: ||SPEPERATOR||안녕하세요. 날씨가 참 좋네요. prompt 아래는 작업을 설명하는 명령어입니다. 요청을 적절히 완료하는 응답을 작성하세요.

명령어:

안녕하세요. 날씨가 참 좋네요.

응답:

Finished chain.> Finished chain. 챗봇 : 안녕하세요! 오늘은 무엇을 도와드릴까요?
q ='자동이체 서비스는 어떻게 신청해야 하나요?'
response = get_answer(q, bank='신한은행',score=True, k=5)

print("챗봇 : ", response)
자동이체 서비스는 어떻게 신청해야 하나요?

Score: 0.007392953 Document Number: 86 Source: 신한은행 no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.006629911 Document Number: 67 Source: 신한은행 no: 22 Category: 인터넷 예적금 해약하려면 어떻게 해야 하나요? Information: 인터넷에서 신규하셨고, 이후 통장발급을 받지 않으셨다면 인터넷뱅킹(http://bank.shinhan.com)의 금융상품 예금/신탁 해지 메뉴를 통해 해지하실 수 있습니다. type: Source: 신한은행

Score: 0.006477051 Document Number: 53 Source: 신한은행 no: 36 Category: 이메일서비스 신청 및 해지 방법은? Information: 이메일서비스는 인터넷뱅킹의 서비스메뉴 중 뱅킹보안서비스 > 입출내역통지서비스 > 입출내역 E-Mail통지서비스에서 신청 및 해지하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.006469134 Document Number: 57 Source: 신한은행 no: 32 Category: 휴대폰통지서비스 신청 방법은? Information: 휴대폰 통지서비스는 본인의 금융거래내역을 거래발생 즉시 등록된 이동통신 단말기로 통지해 주는 서비스입니다. 휴대폰통지서비스 신청방법 ① 개인뱅킹 로그인 ② 뱅킹보안서비스 ③ 입출내역통지서비스 ④ S알리미 서비스와 입출내역 SMS통지서비스 중 선택 휴대폰통지서비스 특징 S알리미 서비스 type: Source: 신한은행

Score: 0.0060997857 Document Number: 58 Source: 신한은행 no: 31 Category: 보안카드 번호 입력은 어떻게 하면 되죠? Information: 씨크리트(보안)카드는 고객님이 은행에서 신한온라인서비스(인터넷뱅킹)에 가입하신 후 받은 카드입니다. 보안카드번호가 필요한 모든 거래에 기타 궁금하신 내용은 신한은행 고객센터 1599-8000로 문의하여 주시기 바랍니다. type: 인터넷뱅킹 Source: 신한은행

highest_score : 0.007392953 lower_bound : 0.007023305349999999

Entering new StuffDocumentsChain chain...> Entering new LLMChain chain... Prompt after formatting: no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행||SPEPERATOR||고객 서비스 센터 직원처럼, 자동이체 서비스는 어떻게 신청해야 하나요? 카테고리에 대한 Information을 찾아서 설명해주세요. prompt 아래는 작업을 설명하는 명령어와 추가 컨텍스트를 제공하는 입력이 짝을 이루는 예제입니다. 요청을 적절히 완료하는 응답을 작성하세요.

명령어:

고객 서비스 센터 직원처럼, 자동이체 서비스는 어떻게 신청해야 하나요? 카테고리에 대한 Information을 찾아서 설명해주세요.

입력:

no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

응답:
Finished chain.> Finished chain. 챗봇 : 인터넷뱅킹에 로그인하고 "지로자동이체 등록" 메뉴를 찾은 다음, "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함한 각종 지로 요금"을 자동이체로 등록할 수 있습니다. 이 정보는 "은행" 카테고리에서 찾을 수 있으며, "인터넷뱅킹" 카테고리에서도 찾을 수 있습니다.

FAQ with OpenSearch - Vector Store Test

이번에는 Local FAISS 말고 Amazon OpenSearch 서비스를 Vector DB로 사용해보겠습니다.

ragostest.ipynb

OpenSearch 클러스터 생성

DB로 활용될 OpenSearch 클러스터를 생성하도록 하겠습니다.

🔗Amazon OpenSearch Service 로 이동하여 도메인을 생성합니다.

Engine options: OpenSearch_2.9

Network: Public access

도메인이 생성되면, 보안구성 탭으로 이동하여 액세스 정책을 수정합니다.

Effect : Deny → Allow

대시보드 URL과 도메인 엔드포인트를 기록해둡니다.

KULLM 엔드포인트 핸들러 작성
%store -r endpoint_name_emb endpoint_name_text
try:
    endpoint_name_emb
    endpoint_name_text
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] TASK-1, TASK-2 노트북을 다시 실행해 주세요.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
import sys
%load_ext autoreload
%autoreload 2
sys.path.append('./utils') # src 폴더 경로 설정
import json
import time
import boto3
import botocore
import numpy as np
from inference_utils import Prompter
from typing import Any, Dict, List, Optional
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
prompter = Prompter("kullm")
params = {
      'do_sample': False,
      'max_new_tokens': 128,
      'temperature': 1.0,
      'top_k': 0,
      'top_p': 0.9,
      'return_full_text': False,
      'repetition_penalty': 1.1,
      'presence_penalty': None,
      'eos_token_id': 2
}

class KullmContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        '''
        입력 데이터 전처리 후에 리턴
        '''
        context, question = prompt.split("||SPEPERATOR||") 
        prompt = prompter.generate_prompt(question, context)

        print ("prompt", prompt)
        payload = {
            'inputs': [prompt],
            'parameters': model_kwargs
        }
                           
        input_str = json.dumps(payload)
        
        return input_str.encode('utf-8')
    

    def transform_output(self, output: bytes) -> str:
        
        response_json = json.loads(output.read().decode("utf-8"))              
        generated_text = response_json[0][0]["generated_text"]
        
        return generated_text
aws_region = boto3.Session().region_name
LLMTextContentHandler = KullmContentHandler()
seperator = "||SPEPERATOR||"

llm_text = SagemakerEndpoint(
    endpoint_name=endpoint_name_text,
    region_name=aws_region,
    model_kwargs=params,    
    content_handler=LLMTextContentHandler,
)
임베딩 모델 엔드포인트 핸들러 작성
class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
    def embed_documents(self, texts: List[str], chunk_size: int=1) -> List[List[float]]:
        """Compute doc embeddings using a SageMaker Inference Endpoint.

        Args:
            texts: The list of texts to embed.
            chunk_size: The chunk size defines how many input texts will
                be grouped together as request. If None, will use the
                chunk size specified by the class.

        Returns:
            List of embeddings, one for each text.
        """
        results = []
        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size
        
        print("text size: ", len(texts))
        print("_chunk_size: ", _chunk_size)

        for i in range(0, len(texts), _chunk_size):
            
            #print (i, texts[i : i + _chunk_size])
            response = self._embedding_func(texts[i : i + _chunk_size])
            #print (i, response, len(response[0].shape))
            
            results.extend(response)
        return results
class KoSimCSERobertaContentHandler(EmbeddingsContentHandler):
    
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        
        input_str = json.dumps({"inputs": prompt, **model_kwargs})
        
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        
        response_json = json.loads(output.read().decode("utf-8"))
        ndim = np.array(response_json).ndim    
        
        if ndim == 4:
            # Original shape (1, 1, n, 768)
            emb = response_json[0][0][0]
            emb = np.expand_dims(emb, axis=0).tolist()
        elif ndim == 2:
            # Original shape (n, 1)
            emb = []
            for ele in response_json:
                e = ele[0][0]
                emb.append(e)
        else:
            print(f"Other # of dimension: {ndim}")
            emb = None
        return emb
LLMEmbHandler = KoSimCSERobertaContentHandler()

llm_emb = SagemakerEndpointEmbeddingsJumpStart(
    endpoint_name=endpoint_name_emb,
    region_name=aws_region,
    content_handler=LLMEmbHandler,
)
데이터 로드

CSVLoader를 사용하여 faq데이터를 로드합니다.
import json
import boto3
from langchain.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(
    file_path="./dataset/fsi_smart_faq_ko.csv",
    source_column="Source",
    encoding="utf-8"
)
context_documents = loader.load()
len(context_documents), context_documents[5]
(89, Document(page_content='no: 84\nCategory: 기존 공동인증서를 보유한 상태에서 금융인증서 발급이 가능한가요?\nInformation: 공동인증서와 금융인증서는 별개의 인증서로 두 가지 인증서를 모두 사용할 수 있습니다.\ntype: 금융인증서\nSource: 신한은행', metadata={'source': '신한은행', 'row': 5}))

OpenSearch에 Data 인덱싱
import time
import pprint
import logging
import sagemaker
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
# global constants
logger = logging.getLogger()
logging.basicConfig(format='%(asctime)s,%(module)s,%(processName)s,%(levelname)s,%(message)s', level=logging.INFO, stream=sys.stderr)

role = sagemaker.get_execution_role()
role
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml

'arn:aws:iam::759320821027:role/service-role/AmazonSageMaker-ExecutionRole-20231030T104069'

사용할 인덱스 이름과 도메인 엔드포인트, 로그인 정보를 입력합니다.
index_name = "fsi-sample"
opensearch_domain_endpoint = "<https://search-hmkim-vectordb-z37b25etffjy4udj5xh7cnhsse.ap-northeast-2.es.amazonaws.com>"
http_auth = ("raguser", "Smileshark12!@")
파이썬용 OpenSearch 라이브러리인 opensearch-py를 인스톨합니다.
! pip install opensearch-py
Installing collected packages: opensearch-py Successfully installed opensearch-py-2.3.2

VectorStore 데이터를 OpenSearch 에 Bulk API를 사용하여 전송합니다.
%%time
logger.info('Loading documents ...')
docs = loader.load()

# # add a custom metadata field, such as timestamp
for doc in docs:
    doc.metadata['timestamp'] = time.time()
    doc.metadata['embeddings_model'] = endpoint_name_emb

text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=200)
documents = text_splitter.split_documents(docs)

# by default langchain would create a k-NN index and the embeddings would be ingested as a k-NN vector type
docsearch = OpenSearchVectorSearch.from_documents(
    index_name=index_name,
    documents=documents,
    embedding=llm_emb,
    opensearch_url=opensearch_domain_endpoint,
    http_auth=http_auth,
    bulk_size=10000,
    timeout=60
)
text size: 90 _chunk_size: 1 CPU times: user 5.85 s, sys: 273 ms, total: 6.13 s Wall time: 18.8 s

OpenSearch 대쉬보드를 사용하여 fsi-sample 인덱스를 사용해보면, 90개의 데이터가 벡터 임베딩화 되어 저장된 것을 확인할 수 있습니다.

🦜🔗LangChain QnA 사용하여 체이닝하기
from functools import lru_cache
from langchain import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
import copy
import functools
import concurrent.futures
prompt_template = ''.join(["{context}", seperator, "{question}"])
PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
chain = load_qa_chain(llm=llm_text, chain_type="stuff", prompt=PROMPT, verbose=True)
vectro_db = OpenSearchVectorSearch(
    index_name=index_name,
    opensearch_url=opensearch_domain_endpoint,
    embedding_function=llm_emb,
    http_auth=http_auth, # http_auth
    is_aoss =False,
    engine="faiss",
    space_type="l2"
)
def pretty_print_documents(response):
    for doc, score in response:
        print(f'\\nScore: {score}')
        print(f'Document Number: {doc.metadata["row"]}')
        print(f'Source: {doc.metadata["source"]}')

        # Split the page content into lines
        lines = doc.page_content.split("\\n")

        # Extract and print each piece of information if it exists
        for line in lines:
            split_line = line.split(": ")
            if len(split_line) > 1:
                print(f'{split_line[0]}: {split_line[1]}')

        print('-' * 50)
def filter_and_remove_score_opensearch_vector_score(res, cutoff_score = 0.006, variance=0.95):
    # Get the lowest score
    highest_score = max(score for doc, score in res)
    print('highest_score : ', highest_score)
    # If the lowest score is over 200, return an empty list
    if highest_score < cutoff_score:
        return []
    # Calculate the upper bound for scores
    lower_bound = highest_score * variance
    print('lower_bound : ', lower_bound)
    # Filter the list and remove the score
    res = [doc for doc, score in res if score >= lower_bound]

    return res

def get_similiar_docs(query, k=5, fetch_k=300, score=True, bank=""):

    
    #query = f'{bank}, {query}'
    print (query)
    
    if score:
        pre_similar_doc = vectro_db.similarity_search_with_score(
            query,
            k=k,
            fetch_k=fetch_k,
            search_type="approximate_search", # approximate_search, script_scoring, painless_scripting
            space_type="l2",     #"l2", "l1", "linf", "cosinesimil", "innerproduct", "hammingbit";
            pre_filter={"bool": {"filter": {"term": {"text": bank}}}},
            boolean_filter={"bool": {"filter": {"term": {"text": bank}}}}
            #filter=dict(source=bank)
        )
        #print('jhs : ', similar_docs)
        pretty_print_documents( pre_similar_doc)
        similar_docs=filter_and_remove_score_opensearch_vector_score(pre_similar_doc)        
    else:
        similar_docs = vectro_db.similarity_search(
            query,
            k=k,
            search_type="approximate_search", # approximate_search, script_scoring, painless_scripting
            space_type="12",     #"l2", "l1", "linf", "cosinesimil", "innerproduct", "hammingbit";
            pre_filter={"bool": {"filter": {"term": {"text": bank}}}},
            boolean_filter={"bool": {"filter": {"term": {"text": bank}}}}
            
        )
    similar_docs_copy = copy.deepcopy(similar_docs)
    
    #print('similar_docs_copy : \\n', similar_docs_copy)
    
    return similar_docs_copy

def get_answer(query, bank="",score=False, fetch_k=300, k=1):
                
    search_query = query
    
    similar_docs = get_similiar_docs(search_query, k=k,score=score, bank=bank)
    

    llm_query = '고객 서비스 센터 직원처럼, '+query+' 카테고리에 대한 Information을 찾아서 설명해주세요.'
    
    if not similar_docs:
        llm_query = query

    answer = chain.run(input_documents=similar_docs, question=llm_query)
    
    return answer
테스트 진행
question ='안녕하세요. 날씨가 참 좋네요.'
response = get_answer(question, bank='신한은행',score=True, k=4)
print("챗봇 : ", response)
Score: 0.0033899078 Document Number: 19 Source: 신한은행 no: 70 Category: 홈페이지상에 제가 등록한 칭찬/불만/제안사항 조회할 수 있나요? Information: 로그인 후 등록한 접수내용에 대해서 확인 가능합니다. type: 홈페이지 Source: 신한은행

Score: 0.0032900404 Document Number: 82 Source: 신한은행 no: 7 Category: 인터넷으로 신규 예/적금 신청하는 방법을 알려주세요 Information: 인터넷상으로 예금/신탁을 신규가입하시려면 우선 고객님께서는인터넷뱅킹에 가입하셔야 하며 신규방법은 두 가지가 있습니다.1. 인터넷뱅킹에서 가입인터넷뱅킹 로그인을 하신 후 예금/신탁 > 신규 메뉴에서 예금 및 신탁 상품을 신규하실 수 있습니다.2. 신한S뱅크에서 가입신한S뱅크 상품센터 > 예금센터 메뉴에서 예금상품을 신규하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.0032718105 Document Number: 49 Source: 신한은행 no: 40 Category: 회원탈퇴 후 메일이 계속와요. Information: 인터넷뱅킹가입을 하시면 예금/대출/카드 등 거래에 대한 안내(예:예금만기 등)외에 영업점안내메일 등 몇가지 부가서비스가 기본제공됩니다. 부가서비스는 홈페이지에 로그인하셔서(인터넷뱅킹사용자는 별도 회원가입이 필요없습니다.) 이메일서비스의 수신/거부 등 변경을 하시면 됩니다. 다만, 인터넷뱅킹을 해지하시더라도 예금,카드,대출 등의 거래가 남아있을 수 있기에 메일서비스는 계속 제공됩니다. 따라서 고객님의 경우에는 기존에 제공되는 메일서비스가 계속 남아있어 부가서비스 메일을 받으신 것이며. 정보가 유출되거나 하는 경우는 절대 없으니 안심하시기 바랍니다. 더이상 부가서비스 이메일 수신을 원하지 않는 경우에는 홈페이지의 회원가입을 하신 후 회원서비스의 이메일서비스에 가셔서 변경하시면 됩니다. type: Source: 신한은행

Score: 0.0032451365 Document Number: 81 Source: 신한은행 no: 8 Category: 보안메일서비스 안내해줘 Information: 신한은행은 이메일서비스를 통해 고객님의 거래정보와 금융정보를 메일로 알려드리는데 고객님께 발송되는 이메일 중 개인정보보호가 필요한 메일(거래정보 등)은 암호화 처리되어 보안메일로 발송되어 일반메일 (홍보 및 안내메일) 과 구별됩니다. ※ 입출내역 통지서비스는 개인뱅킹 > 뱅킹보안서비스 > 통지서비스 > 입출내역 Email통지서비스 메뉴에서 서비스 신청 및 변경이 가능합니다. type: 인터넷뱅킹 Source: 신한은행

highest_score : 0.0033899078

Entering new StuffDocumentsChain chain...> Entering new LLMChain chain... Prompt after formatting: ||SPEPERATOR||안녕하세요. 날씨가 참 좋네요. prompt 아래는 작업을 설명하는 명령어입니다. 요청을 적절히 완료하는 응답을 작성하세요.

명령어:

안녕하세요. 날씨가 참 좋네요.

응답:

Finished chain.> Finished chain. 챗봇 : 안녕하세요! 오늘은 무엇을 도와드릴까요?
q ='자동이체 서비스는 어떻게 신청해야 하나요?'
response = get_answer(q, bank='신한은행',score=True, k=5)

print("챗봇 : ", response)
자동이체 서비스는 어떻게 신청해야 하나요?

Score: 0.007392953 Document Number: 86 Source: 신한은행 no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.006629911 Document Number: 67 Source: 신한은행 no: 22 Category: 인터넷 예적금 해약하려면 어떻게 해야 하나요? Information: 인터넷에서 신규하셨고, 이후 통장발급을 받지 않으셨다면 인터넷뱅킹(http://bank.shinhan.com)의 금융상품 예금/신탁 해지 메뉴를 통해 해지하실 수 있습니다. type: Source: 신한은행

Score: 0.006477051 Document Number: 53 Source: 신한은행 no: 36 Category: 이메일서비스 신청 및 해지 방법은? Information: 이메일서비스는 인터넷뱅킹의 서비스메뉴 중 뱅킹보안서비스 > 입출내역통지서비스 > 입출내역 E-Mail통지서비스에서 신청 및 해지하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.006469134 Document Number: 57 Source: 신한은행 no: 32 Category: 휴대폰통지서비스 신청 방법은? Information: 휴대폰 통지서비스는 본인의 금융거래내역을 거래발생 즉시 등록된 이동통신 단말기로 통지해 주는 서비스입니다. 휴대폰통지서비스 신청방법 ① 개인뱅킹 로그인 ② 뱅킹보안서비스 ③ 입출내역통지서비스 ④ S알리미 서비스와 입출내역 SMS통지서비스 중 선택 휴대폰통지서비스 특징 S알리미 서비스 type: Source: 신한은행

Score: 0.0060997857 Document Number: 58 Source: 신한은행 no: 31 Category: 보안카드 번호 입력은 어떻게 하면 되죠? Information: 씨크리트(보안)카드는 고객님이 은행에서 신한온라인서비스(인터넷뱅킹)에 가입하신 후 받은 카드입니다. 보안카드번호가 필요한 모든 거래에 기타 궁금하신 내용은 신한은행 고객센터 1599-8000로 문의하여 주시기 바랍니다. type: 인터넷뱅킹 Source: 신한은행

highest_score : 0.007392953 lower_bound : 0.007023305349999999

Entering new StuffDocumentsChain chain...> Entering new LLMChain chain... Prompt after formatting: no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행||SPEPERATOR||고객 서비스 센터 직원처럼, 자동이체 서비스는 어떻게 신청해야 하나요? 카테고리에 대한 Information을 찾아서 설명해주세요. prompt 아래는 작업을 설명하는 명령어와 추가 컨텍스트를 제공하는 입력이 짝을 이루는 예제입니다. 요청을 적절히 완료하는 응답을 작성하세요.

명령어:

고객 서비스 센터 직원처럼, 자동이체 서비스는 어떻게 신청해야 하나요? 카테고리에 대한 Information을 찾아서 설명해주세요.

입력:

no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

응답:
Finished chain.> Finished chain. 챗봇 : 인터넷뱅킹에 로그인하고 "지로자동이체 등록" 메뉴를 찾은 다음, "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함한 각종 지로 요금"을 자동이체로 등록할 수 있습니다. 이 정보는 "은행" 카테고리에서 찾을 수 있으며, "인터넷뱅킹" 카테고리에서도 찾을 수 있습니다.

FAQ with OpenSearch - Vector Store Test

이번에는 Local FAISS 말고 Amazon OpenSearch 서비스를 Vector DB로 사용해보겠습니다.

ragostest.ipynb

OpenSearch 클러스터 생성

DB로 활용될 OpenSearch 클러스터를 생성하도록 하겠습니다.

🔗Amazon OpenSearch Service 로 이동하여 도메인을 생성합니다.

Engine options: OpenSearch_2.9

Network: Public access

도메인이 생성되면, 보안구성 탭으로 이동하여 액세스 정책을 수정합니다.

Effect : Deny → Allow

대시보드 URL과 도메인 엔드포인트를 기록해둡니다.

KULLM 엔드포인트 핸들러 작성
%store -r endpoint_name_emb endpoint_name_text
try:
    endpoint_name_emb
    endpoint_name_text
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] TASK-1, TASK-2 노트북을 다시 실행해 주세요.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
import sys
%load_ext autoreload
%autoreload 2
sys.path.append('./utils') # src 폴더 경로 설정
import json
import time
import boto3
import botocore
import numpy as np
from inference_utils import Prompter
from typing import Any, Dict, List, Optional
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
prompter = Prompter("kullm")
params = {
      'do_sample': False,
      'max_new_tokens': 128,
      'temperature': 1.0,
      'top_k': 0,
      'top_p': 0.9,
      'return_full_text': False,
      'repetition_penalty': 1.1,
      'presence_penalty': None,
      'eos_token_id': 2
}

class KullmContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        '''
        입력 데이터 전처리 후에 리턴
        '''
        context, question = prompt.split("||SPEPERATOR||") 
        prompt = prompter.generate_prompt(question, context)

        print ("prompt", prompt)
        payload = {
            'inputs': [prompt],
            'parameters': model_kwargs
        }
                           
        input_str = json.dumps(payload)
        
        return input_str.encode('utf-8')
    

    def transform_output(self, output: bytes) -> str:
        
        response_json = json.loads(output.read().decode("utf-8"))              
        generated_text = response_json[0][0]["generated_text"]
        
        return generated_text
aws_region = boto3.Session().region_name
LLMTextContentHandler = KullmContentHandler()
seperator = "||SPEPERATOR||"

llm_text = SagemakerEndpoint(
    endpoint_name=endpoint_name_text,
    region_name=aws_region,
    model_kwargs=params,    
    content_handler=LLMTextContentHandler,
)
임베딩 모델 엔드포인트 핸들러 작성
class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
    def embed_documents(self, texts: List[str], chunk_size: int=1) -> List[List[float]]:
        """Compute doc embeddings using a SageMaker Inference Endpoint.

        Args:
            texts: The list of texts to embed.
            chunk_size: The chunk size defines how many input texts will
                be grouped together as request. If None, will use the
                chunk size specified by the class.

        Returns:
            List of embeddings, one for each text.
        """
        results = []
        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size
        
        print("text size: ", len(texts))
        print("_chunk_size: ", _chunk_size)

        for i in range(0, len(texts), _chunk_size):
            
            #print (i, texts[i : i + _chunk_size])
            response = self._embedding_func(texts[i : i + _chunk_size])
            #print (i, response, len(response[0].shape))
            
            results.extend(response)
        return results
class KoSimCSERobertaContentHandler(EmbeddingsContentHandler):
    
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        
        input_str = json.dumps({"inputs": prompt, **model_kwargs})
        
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        
        response_json = json.loads(output.read().decode("utf-8"))
        ndim = np.array(response_json).ndim    
        
        if ndim == 4:
            # Original shape (1, 1, n, 768)
            emb = response_json[0][0][0]
            emb = np.expand_dims(emb, axis=0).tolist()
        elif ndim == 2:
            # Original shape (n, 1)
            emb = []
            for ele in response_json:
                e = ele[0][0]
                emb.append(e)
        else:
            print(f"Other # of dimension: {ndim}")
            emb = None
        return emb
LLMEmbHandler = KoSimCSERobertaContentHandler()

llm_emb = SagemakerEndpointEmbeddingsJumpStart(
    endpoint_name=endpoint_name_emb,
    region_name=aws_region,
    content_handler=LLMEmbHandler,
)
데이터 로드

CSVLoader를 사용하여 faq데이터를 로드합니다.
import json
import boto3
from langchain.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(
    file_path="./dataset/fsi_smart_faq_ko.csv",
    source_column="Source",
    encoding="utf-8"
)
context_documents = loader.load()
len(context_documents), context_documents[5]
(89, Document(page_content='no: 84\nCategory: 기존 공동인증서를 보유한 상태에서 금융인증서 발급이 가능한가요?\nInformation: 공동인증서와 금융인증서는 별개의 인증서로 두 가지 인증서를 모두 사용할 수 있습니다.\ntype: 금융인증서\nSource: 신한은행', metadata={'source': '신한은행', 'row': 5}))

OpenSearch에 Data 인덱싱
import time
import pprint
import logging
import sagemaker
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
# global constants
logger = logging.getLogger()
logging.basicConfig(format='%(asctime)s,%(module)s,%(processName)s,%(levelname)s,%(message)s', level=logging.INFO, stream=sys.stderr)

role = sagemaker.get_execution_role()
role
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml

'arn:aws:iam::759320821027:role/service-role/AmazonSageMaker-ExecutionRole-20231030T104069'

사용할 인덱스 이름과 도메인 엔드포인트, 로그인 정보를 입력합니다.
index_name = "fsi-sample"
opensearch_domain_endpoint = "<https://search-hmkim-vectordb-z37b25etffjy4udj5xh7cnhsse.ap-northeast-2.es.amazonaws.com>"
http_auth = ("raguser", "Smileshark12!@")
파이썬용 OpenSearch 라이브러리인 opensearch-py를 인스톨합니다.
! pip install opensearch-py
Installing collected packages: opensearch-py Successfully installed opensearch-py-2.3.2

VectorStore 데이터를 OpenSearch 에 Bulk API를 사용하여 전송합니다.
%%time
logger.info('Loading documents ...')
docs = loader.load()

# # add a custom metadata field, such as timestamp
for doc in docs:
    doc.metadata['timestamp'] = time.time()
    doc.metadata['embeddings_model'] = endpoint_name_emb

text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=200)
documents = text_splitter.split_documents(docs)

# by default langchain would create a k-NN index and the embeddings would be ingested as a k-NN vector type
docsearch = OpenSearchVectorSearch.from_documents(
    index_name=index_name,
    documents=documents,
    embedding=llm_emb,
    opensearch_url=opensearch_domain_endpoint,
    http_auth=http_auth,
    bulk_size=10000,
    timeout=60
)
text size: 90 _chunk_size: 1 CPU times: user 5.85 s, sys: 273 ms, total: 6.13 s Wall time: 18.8 s

OpenSearch 대쉬보드를 사용하여 fsi-sample 인덱스를 사용해보면, 90개의 데이터가 벡터 임베딩화 되어 저장된 것을 확인할 수 있습니다.

🦜🔗LangChain QnA 사용하여 체이닝하기
from functools import lru_cache
from langchain import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
import copy
import functools
import concurrent.futures
prompt_template = ''.join(["{context}", seperator, "{question}"])
PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
chain = load_qa_chain(llm=llm_text, chain_type="stuff", prompt=PROMPT, verbose=True)
vectro_db = OpenSearchVectorSearch(
    index_name=index_name,
    opensearch_url=opensearch_domain_endpoint,
    embedding_function=llm_emb,
    http_auth=http_auth, # http_auth
    is_aoss =False,
    engine="faiss",
    space_type="l2"
)
def pretty_print_documents(response):
    for doc, score in response:
        print(f'\\nScore: {score}')
        print(f'Document Number: {doc.metadata["row"]}')
        print(f'Source: {doc.metadata["source"]}')

        # Split the page content into lines
        lines = doc.page_content.split("\\n")

        # Extract and print each piece of information if it exists
        for line in lines:
            split_line = line.split(": ")
            if len(split_line) > 1:
                print(f'{split_line[0]}: {split_line[1]}')

        print('-' * 50)
def filter_and_remove_score_opensearch_vector_score(res, cutoff_score = 0.006, variance=0.95):
    # Get the lowest score
    highest_score = max(score for doc, score in res)
    print('highest_score : ', highest_score)
    # If the lowest score is over 200, return an empty list
    if highest_score < cutoff_score:
        return []
    # Calculate the upper bound for scores
    lower_bound = highest_score * variance
    print('lower_bound : ', lower_bound)
    # Filter the list and remove the score
    res = [doc for doc, score in res if score >= lower_bound]

    return res

def get_similiar_docs(query, k=5, fetch_k=300, score=True, bank=""):

    
    #query = f'{bank}, {query}'
    print (query)
    
    if score:
        pre_similar_doc = vectro_db.similarity_search_with_score(
            query,
            k=k,
            fetch_k=fetch_k,
            search_type="approximate_search", # approximate_search, script_scoring, painless_scripting
            space_type="l2",     #"l2", "l1", "linf", "cosinesimil", "innerproduct", "hammingbit";
            pre_filter={"bool": {"filter": {"term": {"text": bank}}}},
            boolean_filter={"bool": {"filter": {"term": {"text": bank}}}}
            #filter=dict(source=bank)
        )
        #print('jhs : ', similar_docs)
        pretty_print_documents( pre_similar_doc)
        similar_docs=filter_and_remove_score_opensearch_vector_score(pre_similar_doc)        
    else:
        similar_docs = vectro_db.similarity_search(
            query,
            k=k,
            search_type="approximate_search", # approximate_search, script_scoring, painless_scripting
            space_type="12",     #"l2", "l1", "linf", "cosinesimil", "innerproduct", "hammingbit";
            pre_filter={"bool": {"filter": {"term": {"text": bank}}}},
            boolean_filter={"bool": {"filter": {"term": {"text": bank}}}}
            
        )
    similar_docs_copy = copy.deepcopy(similar_docs)
    
    #print('similar_docs_copy : \\n', similar_docs_copy)
    
    return similar_docs_copy

def get_answer(query, bank="",score=False, fetch_k=300, k=1):
                
    search_query = query
    
    similar_docs = get_similiar_docs(search_query, k=k,score=score, bank=bank)
    

    llm_query = '고객 서비스 센터 직원처럼, '+query+' 카테고리에 대한 Information을 찾아서 설명해주세요.'
    
    if not similar_docs:
        llm_query = query

    answer = chain.run(input_documents=similar_docs, question=llm_query)
    
    return answer
테스트 진행
question ='안녕하세요. 날씨가 참 좋네요.'
response = get_answer(question, bank='신한은행',score=True, k=4)
print("챗봇 : ", response)
Score: 0.0033899078 Document Number: 19 Source: 신한은행 no: 70 Category: 홈페이지상에 제가 등록한 칭찬/불만/제안사항 조회할 수 있나요? Information: 로그인 후 등록한 접수내용에 대해서 확인 가능합니다. type: 홈페이지 Source: 신한은행

Score: 0.0032900404 Document Number: 82 Source: 신한은행 no: 7 Category: 인터넷으로 신규 예/적금 신청하는 방법을 알려주세요 Information: 인터넷상으로 예금/신탁을 신규가입하시려면 우선 고객님께서는인터넷뱅킹에 가입하셔야 하며 신규방법은 두 가지가 있습니다.1. 인터넷뱅킹에서 가입인터넷뱅킹 로그인을 하신 후 예금/신탁 > 신규 메뉴에서 예금 및 신탁 상품을 신규하실 수 있습니다.2. 신한S뱅크에서 가입신한S뱅크 상품센터 > 예금센터 메뉴에서 예금상품을 신규하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.0032718105 Document Number: 49 Source: 신한은행 no: 40 Category: 회원탈퇴 후 메일이 계속와요. Information: 인터넷뱅킹가입을 하시면 예금/대출/카드 등 거래에 대한 안내(예:예금만기 등)외에 영업점안내메일 등 몇가지 부가서비스가 기본제공됩니다. 부가서비스는 홈페이지에 로그인하셔서(인터넷뱅킹사용자는 별도 회원가입이 필요없습니다.) 이메일서비스의 수신/거부 등 변경을 하시면 됩니다. 다만, 인터넷뱅킹을 해지하시더라도 예금,카드,대출 등의 거래가 남아있을 수 있기에 메일서비스는 계속 제공됩니다. 따라서 고객님의 경우에는 기존에 제공되는 메일서비스가 계속 남아있어 부가서비스 메일을 받으신 것이며. 정보가 유출되거나 하는 경우는 절대 없으니 안심하시기 바랍니다. 더이상 부가서비스 이메일 수신을 원하지 않는 경우에는 홈페이지의 회원가입을 하신 후 회원서비스의 이메일서비스에 가셔서 변경하시면 됩니다. type: Source: 신한은행

Score: 0.0032451365 Document Number: 81 Source: 신한은행 no: 8 Category: 보안메일서비스 안내해줘 Information: 신한은행은 이메일서비스를 통해 고객님의 거래정보와 금융정보를 메일로 알려드리는데 고객님께 발송되는 이메일 중 개인정보보호가 필요한 메일(거래정보 등)은 암호화 처리되어 보안메일로 발송되어 일반메일 (홍보 및 안내메일) 과 구별됩니다. ※ 입출내역 통지서비스는 개인뱅킹 > 뱅킹보안서비스 > 통지서비스 > 입출내역 Email통지서비스 메뉴에서 서비스 신청 및 변경이 가능합니다. type: 인터넷뱅킹 Source: 신한은행

highest_score : 0.0033899078

Entering new StuffDocumentsChain chain...> Entering new LLMChain chain... Prompt after formatting: ||SPEPERATOR||안녕하세요. 날씨가 참 좋네요. prompt 아래는 작업을 설명하는 명령어입니다. 요청을 적절히 완료하는 응답을 작성하세요.

명령어:

안녕하세요. 날씨가 참 좋네요.

응답:

Finished chain.> Finished chain. 챗봇 : 안녕하세요! 오늘은 무엇을 도와드릴까요?
q ='자동이체 서비스는 어떻게 신청해야 하나요?'
response = get_answer(q, bank='신한은행',score=True, k=5)

print("챗봇 : ", response)
자동이체 서비스는 어떻게 신청해야 하나요?

Score: 0.007392953 Document Number: 86 Source: 신한은행 no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.006629911 Document Number: 67 Source: 신한은행 no: 22 Category: 인터넷 예적금 해약하려면 어떻게 해야 하나요? Information: 인터넷에서 신규하셨고, 이후 통장발급을 받지 않으셨다면 인터넷뱅킹(http://bank.shinhan.com)의 금융상품 예금/신탁 해지 메뉴를 통해 해지하실 수 있습니다. type: Source: 신한은행

Score: 0.006477051 Document Number: 53 Source: 신한은행 no: 36 Category: 이메일서비스 신청 및 해지 방법은? Information: 이메일서비스는 인터넷뱅킹의 서비스메뉴 중 뱅킹보안서비스 > 입출내역통지서비스 > 입출내역 E-Mail통지서비스에서 신청 및 해지하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.006469134 Document Number: 57 Source: 신한은행 no: 32 Category: 휴대폰통지서비스 신청 방법은? Information: 휴대폰 통지서비스는 본인의 금융거래내역을 거래발생 즉시 등록된 이동통신 단말기로 통지해 주는 서비스입니다. 휴대폰통지서비스 신청방법 ① 개인뱅킹 로그인 ② 뱅킹보안서비스 ③ 입출내역통지서비스 ④ S알리미 서비스와 입출내역 SMS통지서비스 중 선택 휴대폰통지서비스 특징 S알리미 서비스 type: Source: 신한은행

Score: 0.0060997857 Document Number: 58 Source: 신한은행 no: 31 Category: 보안카드 번호 입력은 어떻게 하면 되죠? Information: 씨크리트(보안)카드는 고객님이 은행에서 신한온라인서비스(인터넷뱅킹)에 가입하신 후 받은 카드입니다. 보안카드번호가 필요한 모든 거래에 기타 궁금하신 내용은 신한은행 고객센터 1599-8000로 문의하여 주시기 바랍니다. type: 인터넷뱅킹 Source: 신한은행

highest_score : 0.007392953 lower_bound : 0.007023305349999999

Entering new StuffDocumentsChain chain...> Entering new LLMChain chain... Prompt after formatting: no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행||SPEPERATOR||고객 서비스 센터 직원처럼, 자동이체 서비스는 어떻게 신청해야 하나요? 카테고리에 대한 Information을 찾아서 설명해주세요. prompt 아래는 작업을 설명하는 명령어와 추가 컨텍스트를 제공하는 입력이 짝을 이루는 예제입니다. 요청을 적절히 완료하는 응답을 작성하세요.

명령어:

고객 서비스 센터 직원처럼, 자동이체 서비스는 어떻게 신청해야 하나요? 카테고리에 대한 Information을 찾아서 설명해주세요.

입력:

no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

응답:
Finished chain.> Finished chain. 챗봇 : 인터넷뱅킹에 로그인하고 "지로자동이체 등록" 메뉴를 찾은 다음, "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함한 각종 지로 요금"을 자동이체로 등록할 수 있습니다. 이 정보는 "은행" 카테고리에서 찾을 수 있으며, "인터넷뱅킹" 카테고리에서도 찾을 수 있습니다.

FAQ with OpenSearch - Vector Store Test

이번에는 Local FAISS 말고 Amazon OpenSearch 서비스를 Vector DB로 사용해보겠습니다.

ragostest.ipynb

OpenSearch 클러스터 생성

DB로 활용될 OpenSearch 클러스터를 생성하도록 하겠습니다.

🔗Amazon OpenSearch Service 로 이동하여 도메인을 생성합니다.

Engine options: OpenSearch_2.9

Network: Public access

도메인이 생성되면, 보안구성 탭으로 이동하여 액세스 정책을 수정합니다.

Effect : Deny → Allow

대시보드 URL과 도메인 엔드포인트를 기록해둡니다.

KULLM 엔드포인트 핸들러 작성
%store -r endpoint_name_emb endpoint_name_text
try:
    endpoint_name_emb
    endpoint_name_text
except NameError:
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
    print("[ERROR] TASK-1, TASK-2 노트북을 다시 실행해 주세요.")
    print("++++++++++++++++++++++++++++++++++++++++++++++++++++++++")
import sys
%load_ext autoreload
%autoreload 2
sys.path.append('./utils') # src 폴더 경로 설정
import json
import time
import boto3
import botocore
import numpy as np
from inference_utils import Prompter
from typing import Any, Dict, List, Optional
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.llms.sagemaker_endpoint import LLMContentHandler, SagemakerEndpoint
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
prompter = Prompter("kullm")
params = {
      'do_sample': False,
      'max_new_tokens': 128,
      'temperature': 1.0,
      'top_k': 0,
      'top_p': 0.9,
      'return_full_text': False,
      'repetition_penalty': 1.1,
      'presence_penalty': None,
      'eos_token_id': 2
}

class KullmContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        '''
        입력 데이터 전처리 후에 리턴
        '''
        context, question = prompt.split("||SPEPERATOR||") 
        prompt = prompter.generate_prompt(question, context)

        print ("prompt", prompt)
        payload = {
            'inputs': [prompt],
            'parameters': model_kwargs
        }
                           
        input_str = json.dumps(payload)
        
        return input_str.encode('utf-8')
    

    def transform_output(self, output: bytes) -> str:
        
        response_json = json.loads(output.read().decode("utf-8"))              
        generated_text = response_json[0][0]["generated_text"]
        
        return generated_text
aws_region = boto3.Session().region_name
LLMTextContentHandler = KullmContentHandler()
seperator = "||SPEPERATOR||"

llm_text = SagemakerEndpoint(
    endpoint_name=endpoint_name_text,
    region_name=aws_region,
    model_kwargs=params,    
    content_handler=LLMTextContentHandler,
)
임베딩 모델 엔드포인트 핸들러 작성
class SagemakerEndpointEmbeddingsJumpStart(SagemakerEndpointEmbeddings):
    def embed_documents(self, texts: List[str], chunk_size: int=1) -> List[List[float]]:
        """Compute doc embeddings using a SageMaker Inference Endpoint.

        Args:
            texts: The list of texts to embed.
            chunk_size: The chunk size defines how many input texts will
                be grouped together as request. If None, will use the
                chunk size specified by the class.

        Returns:
            List of embeddings, one for each text.
        """
        results = []
        _chunk_size = len(texts) if chunk_size > len(texts) else chunk_size
        
        print("text size: ", len(texts))
        print("_chunk_size: ", _chunk_size)

        for i in range(0, len(texts), _chunk_size):
            
            #print (i, texts[i : i + _chunk_size])
            response = self._embedding_func(texts[i : i + _chunk_size])
            #print (i, response, len(response[0].shape))
            
            results.extend(response)
        return results
class KoSimCSERobertaContentHandler(EmbeddingsContentHandler):
    
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs={}) -> bytes:
        
        input_str = json.dumps({"inputs": prompt, **model_kwargs})
        
        return input_str.encode("utf-8")

    def transform_output(self, output: bytes) -> str:
        
        response_json = json.loads(output.read().decode("utf-8"))
        ndim = np.array(response_json).ndim    
        
        if ndim == 4:
            # Original shape (1, 1, n, 768)
            emb = response_json[0][0][0]
            emb = np.expand_dims(emb, axis=0).tolist()
        elif ndim == 2:
            # Original shape (n, 1)
            emb = []
            for ele in response_json:
                e = ele[0][0]
                emb.append(e)
        else:
            print(f"Other # of dimension: {ndim}")
            emb = None
        return emb
LLMEmbHandler = KoSimCSERobertaContentHandler()

llm_emb = SagemakerEndpointEmbeddingsJumpStart(
    endpoint_name=endpoint_name_emb,
    region_name=aws_region,
    content_handler=LLMEmbHandler,
)
데이터 로드

CSVLoader를 사용하여 faq데이터를 로드합니다.
import json
import boto3
from langchain.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(
    file_path="./dataset/fsi_smart_faq_ko.csv",
    source_column="Source",
    encoding="utf-8"
)
context_documents = loader.load()
len(context_documents), context_documents[5]
(89, Document(page_content='no: 84\nCategory: 기존 공동인증서를 보유한 상태에서 금융인증서 발급이 가능한가요?\nInformation: 공동인증서와 금융인증서는 별개의 인증서로 두 가지 인증서를 모두 사용할 수 있습니다.\ntype: 금융인증서\nSource: 신한은행', metadata={'source': '신한은행', 'row': 5}))

OpenSearch에 Data 인덱싱
import time
import pprint
import logging
import sagemaker
from langchain.vectorstores import OpenSearchVectorSearch
from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
# global constants
logger = logging.getLogger()
logging.basicConfig(format='%(asctime)s,%(module)s,%(processName)s,%(levelname)s,%(message)s', level=logging.INFO, stream=sys.stderr)

role = sagemaker.get_execution_role()
role
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml

'arn:aws:iam::759320821027:role/service-role/AmazonSageMaker-ExecutionRole-20231030T104069'

사용할 인덱스 이름과 도메인 엔드포인트, 로그인 정보를 입력합니다.
index_name = "fsi-sample"
opensearch_domain_endpoint = "<https://search-hmkim-vectordb-z37b25etffjy4udj5xh7cnhsse.ap-northeast-2.es.amazonaws.com>"
http_auth = ("raguser", "Smileshark12!@")
파이썬용 OpenSearch 라이브러리인 opensearch-py를 인스톨합니다.
! pip install opensearch-py
Installing collected packages: opensearch-py Successfully installed opensearch-py-2.3.2

VectorStore 데이터를 OpenSearch 에 Bulk API를 사용하여 전송합니다.
%%time
logger.info('Loading documents ...')
docs = loader.load()

# # add a custom metadata field, such as timestamp
for doc in docs:
    doc.metadata['timestamp'] = time.time()
    doc.metadata['embeddings_model'] = endpoint_name_emb

text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=200)
documents = text_splitter.split_documents(docs)

# by default langchain would create a k-NN index and the embeddings would be ingested as a k-NN vector type
docsearch = OpenSearchVectorSearch.from_documents(
    index_name=index_name,
    documents=documents,
    embedding=llm_emb,
    opensearch_url=opensearch_domain_endpoint,
    http_auth=http_auth,
    bulk_size=10000,
    timeout=60
)
text size: 90 _chunk_size: 1 CPU times: user 5.85 s, sys: 273 ms, total: 6.13 s Wall time: 18.8 s

OpenSearch 대쉬보드를 사용하여 fsi-sample 인덱스를 사용해보면, 90개의 데이터가 벡터 임베딩화 되어 저장된 것을 확인할 수 있습니다.

🦜🔗LangChain QnA 사용하여 체이닝하기
from functools import lru_cache
from langchain import PromptTemplate
from langchain.chains.question_answering import load_qa_chain
import copy
import functools
import concurrent.futures
prompt_template = ''.join(["{context}", seperator, "{question}"])
PROMPT = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
chain = load_qa_chain(llm=llm_text, chain_type="stuff", prompt=PROMPT, verbose=True)
vectro_db = OpenSearchVectorSearch(
    index_name=index_name,
    opensearch_url=opensearch_domain_endpoint,
    embedding_function=llm_emb,
    http_auth=http_auth, # http_auth
    is_aoss =False,
    engine="faiss",
    space_type="l2"
)
def pretty_print_documents(response):
    for doc, score in response:
        print(f'\\nScore: {score}')
        print(f'Document Number: {doc.metadata["row"]}')
        print(f'Source: {doc.metadata["source"]}')

        # Split the page content into lines
        lines = doc.page_content.split("\\n")

        # Extract and print each piece of information if it exists
        for line in lines:
            split_line = line.split(": ")
            if len(split_line) > 1:
                print(f'{split_line[0]}: {split_line[1]}')

        print('-' * 50)
def filter_and_remove_score_opensearch_vector_score(res, cutoff_score = 0.006, variance=0.95):
    # Get the lowest score
    highest_score = max(score for doc, score in res)
    print('highest_score : ', highest_score)
    # If the lowest score is over 200, return an empty list
    if highest_score < cutoff_score:
        return []
    # Calculate the upper bound for scores
    lower_bound = highest_score * variance
    print('lower_bound : ', lower_bound)
    # Filter the list and remove the score
    res = [doc for doc, score in res if score >= lower_bound]

    return res

def get_similiar_docs(query, k=5, fetch_k=300, score=True, bank=""):

    
    #query = f'{bank}, {query}'
    print (query)
    
    if score:
        pre_similar_doc = vectro_db.similarity_search_with_score(
            query,
            k=k,
            fetch_k=fetch_k,
            search_type="approximate_search", # approximate_search, script_scoring, painless_scripting
            space_type="l2",     #"l2", "l1", "linf", "cosinesimil", "innerproduct", "hammingbit";
            pre_filter={"bool": {"filter": {"term": {"text": bank}}}},
            boolean_filter={"bool": {"filter": {"term": {"text": bank}}}}
            #filter=dict(source=bank)
        )
        #print('jhs : ', similar_docs)
        pretty_print_documents( pre_similar_doc)
        similar_docs=filter_and_remove_score_opensearch_vector_score(pre_similar_doc)        
    else:
        similar_docs = vectro_db.similarity_search(
            query,
            k=k,
            search_type="approximate_search", # approximate_search, script_scoring, painless_scripting
            space_type="12",     #"l2", "l1", "linf", "cosinesimil", "innerproduct", "hammingbit";
            pre_filter={"bool": {"filter": {"term": {"text": bank}}}},
            boolean_filter={"bool": {"filter": {"term": {"text": bank}}}}
            
        )
    similar_docs_copy = copy.deepcopy(similar_docs)
    
    #print('similar_docs_copy : \\n', similar_docs_copy)
    
    return similar_docs_copy

def get_answer(query, bank="",score=False, fetch_k=300, k=1):
                
    search_query = query
    
    similar_docs = get_similiar_docs(search_query, k=k,score=score, bank=bank)
    

    llm_query = '고객 서비스 센터 직원처럼, '+query+' 카테고리에 대한 Information을 찾아서 설명해주세요.'
    
    if not similar_docs:
        llm_query = query

    answer = chain.run(input_documents=similar_docs, question=llm_query)
    
    return answer
테스트 진행
question ='안녕하세요. 날씨가 참 좋네요.'
response = get_answer(question, bank='신한은행',score=True, k=4)
print("챗봇 : ", response)
Score: 0.0033899078 Document Number: 19 Source: 신한은행 no: 70 Category: 홈페이지상에 제가 등록한 칭찬/불만/제안사항 조회할 수 있나요? Information: 로그인 후 등록한 접수내용에 대해서 확인 가능합니다. type: 홈페이지 Source: 신한은행

Score: 0.0032900404 Document Number: 82 Source: 신한은행 no: 7 Category: 인터넷으로 신규 예/적금 신청하는 방법을 알려주세요 Information: 인터넷상으로 예금/신탁을 신규가입하시려면 우선 고객님께서는인터넷뱅킹에 가입하셔야 하며 신규방법은 두 가지가 있습니다.1. 인터넷뱅킹에서 가입인터넷뱅킹 로그인을 하신 후 예금/신탁 > 신규 메뉴에서 예금 및 신탁 상품을 신규하실 수 있습니다.2. 신한S뱅크에서 가입신한S뱅크 상품센터 > 예금센터 메뉴에서 예금상품을 신규하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.0032718105 Document Number: 49 Source: 신한은행 no: 40 Category: 회원탈퇴 후 메일이 계속와요. Information: 인터넷뱅킹가입을 하시면 예금/대출/카드 등 거래에 대한 안내(예:예금만기 등)외에 영업점안내메일 등 몇가지 부가서비스가 기본제공됩니다. 부가서비스는 홈페이지에 로그인하셔서(인터넷뱅킹사용자는 별도 회원가입이 필요없습니다.) 이메일서비스의 수신/거부 등 변경을 하시면 됩니다. 다만, 인터넷뱅킹을 해지하시더라도 예금,카드,대출 등의 거래가 남아있을 수 있기에 메일서비스는 계속 제공됩니다. 따라서 고객님의 경우에는 기존에 제공되는 메일서비스가 계속 남아있어 부가서비스 메일을 받으신 것이며. 정보가 유출되거나 하는 경우는 절대 없으니 안심하시기 바랍니다. 더이상 부가서비스 이메일 수신을 원하지 않는 경우에는 홈페이지의 회원가입을 하신 후 회원서비스의 이메일서비스에 가셔서 변경하시면 됩니다. type: Source: 신한은행

Score: 0.0032451365 Document Number: 81 Source: 신한은행 no: 8 Category: 보안메일서비스 안내해줘 Information: 신한은행은 이메일서비스를 통해 고객님의 거래정보와 금융정보를 메일로 알려드리는데 고객님께 발송되는 이메일 중 개인정보보호가 필요한 메일(거래정보 등)은 암호화 처리되어 보안메일로 발송되어 일반메일 (홍보 및 안내메일) 과 구별됩니다. ※ 입출내역 통지서비스는 개인뱅킹 > 뱅킹보안서비스 > 통지서비스 > 입출내역 Email통지서비스 메뉴에서 서비스 신청 및 변경이 가능합니다. type: 인터넷뱅킹 Source: 신한은행

highest_score : 0.0033899078

Entering new StuffDocumentsChain chain...> Entering new LLMChain chain... Prompt after formatting: ||SPEPERATOR||안녕하세요. 날씨가 참 좋네요. prompt 아래는 작업을 설명하는 명령어입니다. 요청을 적절히 완료하는 응답을 작성하세요.

명령어:

안녕하세요. 날씨가 참 좋네요.

응답:

Finished chain.> Finished chain. 챗봇 : 안녕하세요! 오늘은 무엇을 도와드릴까요?
q ='자동이체 서비스는 어떻게 신청해야 하나요?'
response = get_answer(q, bank='신한은행',score=True, k=5)

print("챗봇 : ", response)
자동이체 서비스는 어떻게 신청해야 하나요?

Score: 0.007392953 Document Number: 86 Source: 신한은행 no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.006629911 Document Number: 67 Source: 신한은행 no: 22 Category: 인터넷 예적금 해약하려면 어떻게 해야 하나요? Information: 인터넷에서 신규하셨고, 이후 통장발급을 받지 않으셨다면 인터넷뱅킹(http://bank.shinhan.com)의 금융상품 예금/신탁 해지 메뉴를 통해 해지하실 수 있습니다. type: Source: 신한은행

Score: 0.006477051 Document Number: 53 Source: 신한은행 no: 36 Category: 이메일서비스 신청 및 해지 방법은? Information: 이메일서비스는 인터넷뱅킹의 서비스메뉴 중 뱅킹보안서비스 > 입출내역통지서비스 > 입출내역 E-Mail통지서비스에서 신청 및 해지하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

Score: 0.006469134 Document Number: 57 Source: 신한은행 no: 32 Category: 휴대폰통지서비스 신청 방법은? Information: 휴대폰 통지서비스는 본인의 금융거래내역을 거래발생 즉시 등록된 이동통신 단말기로 통지해 주는 서비스입니다. 휴대폰통지서비스 신청방법 ① 개인뱅킹 로그인 ② 뱅킹보안서비스 ③ 입출내역통지서비스 ④ S알리미 서비스와 입출내역 SMS통지서비스 중 선택 휴대폰통지서비스 특징 S알리미 서비스 type: Source: 신한은행

Score: 0.0060997857 Document Number: 58 Source: 신한은행 no: 31 Category: 보안카드 번호 입력은 어떻게 하면 되죠? Information: 씨크리트(보안)카드는 고객님이 은행에서 신한온라인서비스(인터넷뱅킹)에 가입하신 후 받은 카드입니다. 보안카드번호가 필요한 모든 거래에 기타 궁금하신 내용은 신한은행 고객센터 1599-8000로 문의하여 주시기 바랍니다. type: 인터넷뱅킹 Source: 신한은행

highest_score : 0.007392953 lower_bound : 0.007023305349999999

Entering new StuffDocumentsChain chain...> Entering new LLMChain chain... Prompt after formatting: no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행||SPEPERATOR||고객 서비스 센터 직원처럼, 자동이체 서비스는 어떻게 신청해야 하나요? 카테고리에 대한 Information을 찾아서 설명해주세요. prompt 아래는 작업을 설명하는 명령어와 추가 컨텍스트를 제공하는 입력이 짝을 이루는 예제입니다. 요청을 적절히 완료하는 응답을 작성하세요.

명령어:

고객 서비스 센터 직원처럼, 자동이체 서비스는 어떻게 신청해야 하나요? 카테고리에 대한 Information을 찾아서 설명해주세요.

입력:

no: 3 Category: 공과금 자동이체 신청이 가능한가요? Information: 인터넷뱅킹에 로그인하신 후 "공과금/법원 > 공과금센터" 페이지에 가시면 "지로자동이체 등록" 메뉴가 있습니다. 해당 메뉴를 통하여 "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함하여 각종 지로요금"을 모두 자동이체 등록하실 수 있습니다. type: 인터넷뱅킹 Source: 신한은행

응답:

Finished chain.> Finished chain. 챗봇 : 인터넷뱅킹에 로그인하고 "지로자동이체 등록" 메뉴를 찾은 다음, "전기요금, 전화요금, 국민연금, 국민건강보험료 등을 포함한 각종 지로 요금"을 자동이체로 등록할 수 있습니다. 이 정보는 "은행" 카테고리에서 찾을 수 있으며, "인터넷뱅킹" 카테고리에서도 찾을 수 있습니다.

'SageMaker' 카테고리의 다른 글

SageMaker, Streamlit, Opensearch를 사용한 RAG챗봇 구성하기 6. OpenSearch Rag Chatbot Application with Streamlit (0)	2023.11.06
SageMaker, Streamlit, Opensearch를 사용한 RAG챗봇 구성하기 4. FAQ with FAISS - Vector Store Test (0)	2023.11.06
SageMaker, Streamlit, Opensearch를 사용한 RAG챗봇 구성하기 3. KULLM(구름)모델 AWS Large Model Container DLC사용하여 배포하기 (0)	2023.11.06
SageMaker, Streamlit, Opensearch를 사용한 RAG챗봇 구성하기 2. KoSimCSE-RoBERTAa SageMaker Studio 테스트 (0)	2023.11.06
SageMaker, Streamlit, Opensearch를 사용한 RAG챗봇 구성하기 1. KoSimCSE-RoBERTa를 사용한 한국어 문장 임베딩 (0)	2023.11.06

SageMaker, Streamlit, Opensearch를 사용한 RAG챗봇 구성하기 5. FAQ with OpenSearch - Vector Store Test

목차

OpenSearch 클러스터 생성

KULLM 엔드포인트 핸들러 작성

임베딩 모델 엔드포인트 핸들러 작성

데이터 로드

OpenSearch에 Data 인덱싱

🦜🔗LangChain QnA 사용하여 체이닝하기

테스트 진행

명령어:

응답:

명령어:

입력:

응답:

FAQ with OpenSearch - Vector Store Test

OpenSearch 클러스터 생성

KULLM 엔드포인트 핸들러 작성

임베딩 모델 엔드포인트 핸들러 작성

데이터 로드

OpenSearch에 Data 인덱싱

🦜🔗LangChain QnA 사용하여 체이닝하기

테스트 진행

Score: 0.0033899078 Document Number: 19 Source: 신한은행 no: 70 Category: 홈페이지상에 제가 등록한 칭찬/불만/제안사항 조회할 수 있나요? Information: 로그인 후 등록한 접수내용에 대해서 확인 가능합니다. type: 홈페이지 Source: 신한은행

명령어:

응답:

명령어:

입력:

응답:

FAQ with OpenSearch - Vector Store Test

OpenSearch 클러스터 생성

KULLM 엔드포인트 핸들러 작성

임베딩 모델 엔드포인트 핸들러 작성

데이터 로드

OpenSearch에 Data 인덱싱

🦜🔗LangChain QnA 사용하여 체이닝하기

테스트 진행

Score: 0.0033899078 Document Number: 19 Source: 신한은행 no: 70 Category: 홈페이지상에 제가 등록한 칭찬/불만/제안사항 조회할 수 있나요? Information: 로그인 후 등록한 접수내용에 대해서 확인 가능합니다. type: 홈페이지 Source: 신한은행

명령어:

응답:

명령어:

입력:

응답:

FAQ with OpenSearch - Vector Store Test

OpenSearch 클러스터 생성

KULLM 엔드포인트 핸들러 작성

임베딩 모델 엔드포인트 핸들러 작성

데이터 로드

OpenSearch에 Data 인덱싱

🦜🔗LangChain QnA 사용하여 체이닝하기

테스트 진행

Score: 0.0033899078 Document Number: 19 Source: 신한은행 no: 70 Category: 홈페이지상에 제가 등록한 칭찬/불만/제안사항 조회할 수 있나요? Information: 로그인 후 등록한 접수내용에 대해서 확인 가능합니다. type: 홈페이지 Source: 신한은행

명령어:

응답:

명령어:

입력:

응답:

FAQ with OpenSearch - Vector Store Test

OpenSearch 클러스터 생성

KULLM 엔드포인트 핸들러 작성

임베딩 모델 엔드포인트 핸들러 작성

데이터 로드

OpenSearch에 Data 인덱싱

🦜🔗LangChain QnA 사용하여 체이닝하기

테스트 진행

Score: 0.0033899078 Document Number: 19 Source: 신한은행 no: 70 Category: 홈페이지상에 제가 등록한 칭찬/불만/제안사항 조회할 수 있나요? Information: 로그인 후 등록한 접수내용에 대해서 확인 가능합니다. type: 홈페이지 Source: 신한은행

명령어:

응답:

명령어:

입력:

응답:

FAQ with OpenSearch - Vector Store Test

OpenSearch 클러스터 생성

KULLM 엔드포인트 핸들러 작성

임베딩 모델 엔드포인트 핸들러 작성

데이터 로드

OpenSearch에 Data 인덱싱

🦜🔗LangChain QnA 사용하여 체이닝하기

테스트 진행

Score: 0.0033899078 Document Number: 19 Source: 신한은행 no: 70 Category: 홈페이지상에 제가 등록한 칭찬/불만/제안사항 조회할 수 있나요? Information: 로그인 후 등록한 접수내용에 대해서 확인 가능합니다. type: 홈페이지 Source: 신한은행

명령어: