๐Ÿ ๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹/์ถ”์ฒœ์‹œ์Šคํ…œ

[Python] ์ปจํ…์ธ  ๊ธฐ๋ฐ˜ ์ถ”์ฒœ(CB) ์‹ค์Šต - TMDB 5000 ์˜ํ™” ๋ฐ์ดํ„ฐ ์„ธํŠธ

xod22 2022. 3. 8. 00:58
728x90

2022.01.14 - [๋จธ์‹ ๋Ÿฌ๋‹ | ๋”ฅ๋Ÿฌ๋‹/์ถ”์ฒœ์‹œ์Šคํ…œ] - [K-Data x ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ] 2-1. ์ปจํ…์ธ  ๊ธฐ๋ฐ˜ ์ถ”์ฒœ(CB), TF-IDF

 

[K-Data x ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ] 2-1. ์ปจํ…์ธ  ๊ธฐ๋ฐ˜ ์ถ”์ฒœ(CB), TF-IDF

# ์ปจํ…์ธ  ๊ธฐ๋ฐ˜ ์ถ”์ฒœ? : CB(Content-based Recommendation) ์œ ์ € A๋ผ๋Š” ์‚ฌ๋žŒ์ด ๊ณผ๊ฑฐ์— ์„ ํ˜ธํ•œ ์•„์ดํ…œ์˜ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๋น„์Šทํ•œ ์•„์ดํ…œ์„ ์œ ์ € A์—๊ฒŒ ์ถ”์ฒœํ•œ๋‹ค. => ์•„์ดํ…œ์˜ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ์˜ ์˜ˆ) - ์˜ํ™” : ๋ฐฐ

xod22.tistory.com

CB์— ๋Œ€ํ•œ ์ด๋ก ์€ ๋‹ค๋ค„๋ณด์•˜๋Š”๋ฐ ์ด๋ฒˆ์—” ์ง์ ‘ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ปจํ…์ธ  ๊ธฐ๋ฐ˜ ์ถ”์ฒœ(CB)๋ฅผ ์‹ค์Šตํ•ด๋ณด๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค!


CB(Content-based Recommendation)

: ๋จผ์ € ์‹ค์Šต์— ์•ž์„œ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋‹ค์‹œ CB(Content-based Recommendation, ์ปจํ…์ธ ๊ธฐ๋ฐ˜์ถ”์ฒœ)์— ๋Œ€ํ•ด ์„ค๋ช…ํ•˜์ž๋ฉด

์œ ์ € A๋ผ๋Š” ์‚ฌ๋žŒ์ด ๊ณผ๊ฑฐ์— ์„ ํ˜ธํ•œ ์•„์ดํ…œ์˜ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๋น„์Šทํ•œ ์•„์ดํ…œ์„ ์œ ์ € A์—๊ฒŒ ์ถ”์ฒœํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

 

* ์•„์ดํ…œ ํ”„๋กœํŒŒ์ผ๋ฒกํ„ฐ๋ฅผ ํ†ตํ•ด ์•„์ดํ…œ๋ผ๋ฆฌ์˜ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•˜๊ณ 

์œ ์‚ฌ๋„๊ฐ€ ๋†’์€ ์•„์ดํ…œ์„ ์ถ”์ฒœํ•˜๋Š” ๊ณผ์ •์œผ๋กœ ์ถ”์ฒœ์ด ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค!

 

 

์‹ค์Šต - ๋ฐฉ๋ฒ•1

 

1. ํŒจํ‚ค์ง€ ์ž„ํฌํŠธ ๋ฐ ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

https://www.kaggle.com/tmdb/tmdb-movie-metadata

 

TMDB 5000 Movie Dataset

Metadata on ~5,000 movies from TMDb

www.kaggle.com

์บ๊ธ€ ๋งํฌ์—์„œ tmdb_5000_movies.csv ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋ฐ›์Šต๋‹ˆ๋‹ค.

import pandas as pd
import numpy as np
import warnings; warnings.filterwarnings('ignore')
movies=pd.read_csv('tmdb_5000_movies.csv')
print(movies.shape)
movies.head(1)

->๋ฐ์ดํ„ฐ๊ฐ€ 4803๊ฐœ์˜ ๋ ˆ์ฝ”๋“œ์™€ 20๊ฐœ์˜ ํ”ผ์ฒ˜๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ์Œ

 

 

~ํ•„์š”ํ•œ ์ปฌ๋Ÿผ๋งŒ ์ €์žฅ~

์ฝ˜ํ…์ธ  ๊ธฐ๋ฐ˜ ํ•„ํ„ฐ๋ง์€ ์‚ฌ์šฉ์ž๊ฐ€ ์ข‹์•„ํ•˜๋Š” ์˜ํ™”์™€ ๋น„์Šทํ•œ ํŠน์„ฑ/์†์„ฑ, ๊ตฌ์„ฑ ์š”์†Œ ๋“ฑ์„ ๊ฐ€์ง„ ๋‹ค๋ฅด ์˜ํ™”๋ฅผ ์ถ”์ฒœํ•ด์ฃผ๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค..!

๋”ฐ๋ผ์„œ id, title, genres, vote_average(ํ‰์ ), vote_count, popularity, keywords, overview ์ปฌ๋Ÿผ๋งŒ ์‚ฌ์šฉํ•ด๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

movies_df=movies[['id','title', 'genres', 'vote_average', 'vote_count', 'popularity', 'keywords', 'overview']]

 

 

~์žฅ๋ฅด, ํ‚ค์›Œ๋“œ ์ปฌ๋Ÿผ์˜ ํ˜•ํƒœ ํ™•์ธ~

movies_df[['genres','keywords']]

๋ฆฌ์ŠคํŠธ ๋‚ด๋ถ€์— ๋”•์…”๋„ˆ๋ฆฌ๊ฐ€ ์žˆ๋Š” ํ˜•ํƒœ์˜ ๋ฌธ์ž์—ด๋กœ ์ €์žฅ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค..

 

 

~ํ•˜๋‚˜์˜ ํ–‰ ์‚ดํŽด๋ณด๊ธฐ~

: ์ปฌ๋Ÿผ์˜ ๊ฐ„๊ฒฉ์„ ๋„“ํ˜€ ๋งŽ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ์ถœ๋ ฅ๋  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์—ฌ ํ–‰ ํ•œ๊ฐœ๋งŒ ์ถœ๋ ฅํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!

pd.set_option('max_colwidth', 100)
#ํ–‰ ํ•œ๊ฐœ๋งŒ ์ถœ๋ ฅํ•ด๋ด„
movies_df[['genres', 'keywords']][:1]

์ด ๊ฐœ๋ณ„ ์žฅ๋ฅด์˜ ๋ช…์นญ์€ ๋”•์…”๋„ˆ๋ฆฌ์˜ ํ‚ค(key)์ธ "name"์œผ๋กœ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

~genres(์žฅ๋ฅด)์ปฌ๋Ÿผ์˜ ๋ฌธ์ž์—ด์„ ๋ถ„ํ•ด-> ๊ฐœ๋ณ„ ์žฅ๋ฅด๋ฅผ ํŒŒ์ด์ฌ ๋ฆฌ์ŠคํŠธ ๊ฐ์ฒด๋กœ ์ถ”์ถœ~

: genres, keywords ์ปฌ๋Ÿผ์€ ๋ฌธ์ž์—ด์ด ์•„๋‹Œ ๋ฆฌ์ŠคํŠธ ๋‚ด๋ถ€์— ์—ฌ๋Ÿฌ ์žฅ๋ฅด ๋”•์…”๋„ˆ๋ฆฌ๋กœ ๊ตฌ์„ฑ๋œ ๊ฐ์ฒด๊ฐ€๋จ

from ast import literal_eval

movies_df['genres']=movies_df['genres'].apply(literal_eval)
movies_df['keywords']=movies_df['keywords'].apply(literal_eval)

 

#์ปฌ๋Ÿผ์—์„œ ['Action']/['Adventure']๊ณผ ๊ฐ™์€ ์žฅ๋ฅด๋ช…๋งŒ ๋ฆฌ์ŠคํŠธ์˜ ๊ฐ์ฒด๋กœ ์ถ”์ถœ

movies_df['genres']=movies_df['genres'].apply(lambda x : [y['name'] for y in x])
movies_df['keywords']=movies_df['keywords'].apply(lambda x : [y['name'] for y in x])

#ํ™•์ธ
movies_df[['genres', 'keywords']][:1]

์ž˜ ์ถ”์ถœ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 


2. ์žฅ๋ฅด๊ฐ’์˜ CountVectorizer

 

#CB : ์žฅ๋ฅด๊ฐ’์œผ๋กœ ์œ ์‚ฌ๋„๋ฅผ ๋น„๊ตํ•œ ๋’ค ๋†’์€ ํ‰์ ์„ ๊ฐ–๋Š” ์˜ํ™”๋ฅผ ์ถ”์ฒœ
#genres์ปฌ๋Ÿผ์„ ๋ฌธ์ž์—ด๋กœ ๋ณ€๊ฒฝํ•œ ๋’ค CountVectorizer๋กœ ํ”ผ์ฒ˜ ๋ฒกํ„ฐํ™”ํ•œ ํ–‰๋ ฌ๊ฐ’์— ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋ฅผ ์ ์šฉํ•ด ์˜ํ™”๋ณ„ ์œ ์‚ฌ์„ฑ ํŒ๋‹จ

from sklearn.feature_extraction.text import CountVectorizer

#CountVectorizer๋ฅผ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด ๊ณต๋ฐฑ๋ฌธ์ž๋กœ word๋‹จ์œ„๊ฐ€ ๊ตฌ๋ถ„๋˜๋Š” ๋ฌธ์ž์—ด๋กœ ๋ณ€ํ™˜
movies_df['genres_literal']=movies_df['genres'].apply(lambda x : (' ').join(x))

 

 

~๋ณ€ํ™˜ ํ™•์ธ~

#๋ณ€ํ™˜ ํ™•์ธ
print(movies_df[['genres']][:1])
print(movies_df[['genres_literal']][:1])

์ž˜ ๋ณ€ํ™˜๋˜์—ˆ์Œ์„ ํ™•์ธ

 

 

~CountVectorizer ์ ์šฉ~

#CountVectorizerํ•จ์ˆ˜->count_vect์ด๋ผ๋Š” ํ•จ์ˆ˜๋ช…์œผ๋กœ ์ƒ์„ฑ
count_vect=CountVectorizer(min_df=0, ngram_range=(1,2))
#min_df : ๋‹จ์–ด์žฅ์— ํฌํ•จ๋˜๊ธฐ ์œ„ํ•œ ์ตœ์†Œ๋นˆ๋„

#'genres_literal'์ปฌ๋Ÿผ์œผ๋กœ CountVector์ƒ์„ฑ
genre_mat=count_vect.fit_transform(movies_df['genres_literal'])
print(genre_mat.shape)

: 4803๊ฐœ ์ปฌ๋Ÿผ์ด์—ˆ์œผ๋ฏ€๋กœ 4803๊ฐœ ๋ ˆ์ฝ”๋“œ, 276๊ฐœ์˜ ๊ฐœ๋ณ„๋‹จ์–ด ํ”ผ์ฒ˜๋กœ ๊ตฌ์„ฑ๋œ ํ”ผ์ฒ˜๋ฒกํ„ฐ ํ–‰๋ ฌ ์ƒ์„ฑ

 


3. ์žฅ๋ฅด๊ฐ’์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„

from sklearn.metrics.pairwise import cosine_similarity
genre_sim=cosine_similarity(genre_mat, genre_mat)
print(genre_sim.shape)

#2๊ฐœ ํ–‰๋งŒ ํ™•์ธํ•ด๋ณด๊ธฐ!
print(genre_sim[:2])

 

 

~์œ ์‚ฌ๋„ ๊ฐ’์ด ๋†’์€ ์ธ๋ฑ์Šค ์ถ”์ถœ~

: ์œ ์‚ฌ๋„ ๊ฐ’์ด ๋†’์€ ์ˆœ์œผ๋กœ ์œ„์น˜ ์ธ๋ฑ์Šค ์ถ”์ถœ(?)

genre_sim_sorted_ind=genre_sim.argsort()[:, ::-1]

print(genre_sim_sorted_ind[:1])

์—ฌ๊ธฐ์„œ ::-1์€ ๋‚ด๋ฆผ์ฐจ์ˆœ์ด๋ผ๋Š” ์˜๋ฏธ

์ฒซ๋ฒˆ์งธ ํ–‰๋งŒ ํ™•์ธํ•ด๋ณด๋ฉด ์œ ์‚ฌ๋„ ๊ฐ’์ด ๋†’์€ ์ธ๋ฑ์Šค๊ฐ’์ด 0(๋ณธ์ธ)-> 3494๋ฒˆ์งธ ํ–‰-> 813๋ฒˆ์งธ ํ–‰...์ˆœ์„œ๋Œ€๋กœ ๋‚˜์—ด๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค..!

 


4. ์žฅ๋ฅด ์œ ์‚ฌ๋„์— ๋”ฐ๋ผ ์˜ํ™”๋ฅผ ์ถ”์ฒœ

# ์žฅ๋ฅด ์œ ์‚ฌ๋„์— ๋”ฐ๋ผ ์˜ํ™”๋ฅผ ์ถ”์ฒœํ•˜๋Š” ํ•จ์ˆ˜ find_sim_movie()์ƒ์„ฑ
def find_sim_movie(df, sorted_ind, title_name, top_n=10):
    
    #์ธ์ž๋กœ ์ž…๋ ฅ๋œ movies_df(๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„)์—์„œ ์ž…๋ ฅ๋ฐ›์€ 'title'(์ œ๋ชฉ) ์ปฌ๋Ÿผ์ด ์ž…๋ ฅ๋œ ๊ฐ’๋งŒ ์ถ”์ถœํ•˜์—ฌ ์ €์žฅ
    title_movie = df[df['title'] == title_name]
    
    #title_named๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์˜ index ๊ฐ์ฒด๋ฅผ ndarray๋กœ ๋ณ€ํ™˜ -> ๋ช‡๋ฒˆ์งธ ์˜ํ™”์ธ์ง€? ์ธ๋ฑ์Šค ์ €์žฅ
    #sorted_ind(์œ ์‚ฌ๋„๊ฐ’) ์ธ์ž๋กœ ์ž…๋ ฅ๋œ genre_sim_sorted_ind ๊ฐ์ฒด์—์„œ ์œ ์‚ฌ๋„ ์ˆœ์œผ๋กœ top_n๊ฐœ์˜ index  ์ถ”์ถœ
    title_index = title_movie.index.values
    similar_indexes = sorted_ind[title_index, :(top_n)]
    
    #์ถ”์ถœ๋œ top_n index๋ฅผ ์ถœ๋ ฅ. top_n index๋Š” 2์ฐจ์› ๋ฐ์ดํ„ฐ
    print(similar_indexes)
    #๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์—์„œ index๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด 1์ฐจ์› array๋กœ ๋ณ€๊ฒฝ
    similar_indexes = similar_indexes.reshape(-1)
    
    #์›๋ž˜ df์ค‘์— ์ธ๋ฑ์Šค์— ํฌํ•จ๋œ ํ–‰์„ return
    return df.iloc[similar_indexes]

 

์œ„์—์„œ ๋งŒ๋“  find_sim_movie() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ์˜ํ™” 'The Godfather(๋Œ€๋ถ€)'์™€ ์žฅ๋ฅด๋ณ„๋กœ ์œ ์‚ฌํ•œ ์˜ํ™” 10๊ฐœ๋ฅผ ์ถ”์ฒœ

similar_movies = find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather',10)
similar_movies[['title', 'vote_average']]

๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ถ”์ฒœ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•œ๋‹ค..!


๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด '๋Œ€๋ถ€ 2ํŽธ'์ด ๊ฐ€์žฅ ๋จผ์ € ์ถ”์ฒœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ 'Light Sleeper', 'Mi America', 'Kids' ๋“ฑ ๋Œ€๋ถ€๋ฅผ ์ข‹์•„ํ•˜๋Š” ๊ณ ๊ฐ์—๊ฒŒ ์ถ”์ฒœํ•˜๊ธฐ ์–ด๋ ค์šด ์˜ํ™”๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

 

'Light Sleeper'์˜ ๊ฒฝ์šฐ ํ‰์ ์ด ๋‚ฎ์€ ํŽธ์ด๊ณ , 'Mi America'์˜ ๊ฒฝ์šฐ ํ‰์ ์ด 0์ ์ž…๋‹ˆ๋‹ค..!

 

์ด๋Ÿฌํ•œ ์ถ”์ฒœ ๊ฒฐ๊ณผ๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด ์ข€ ๋” ๋งŽ์€ ํ›„๋ณด๊ตฐ์„ ์„ ์ •ํ•œ ๋’ค ํ‰์ ์— ๋”ฐ๋ผ ํ•„ํ„ฐ๋งํ•ด์„œ ์ตœ์ข… ์ถ”์ฒœํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ ๋‹ค์‹œ ๊ตฌํ˜„ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.


์‹ค์Šต - ๋ฐฉ๋ฒ•2 (๊ฐ€์ค‘์น˜๋ฅผ ๊ณ ๋ ค)

 

: ์‹ค์Šต1๊ณผ ๊ฐ™์€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์ข€ ๋” ๋งŽ์€ ํ›„๋ณด๊ตฐ ์„ ์ •, ํ‰์ ์— ๋”ฐ๋ผ ํ•„ํ„ฐ๋งํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!

 

1. ๋ฐ์ดํ„ฐ ํ™•์ธ

# vote_average : ์˜ํ™”์˜ ํ‰์  ํ‰๊ท (0~10์ )

# vote_count : ํ‰๊ฐ€ ํšŸ์ˆ˜

movies_df[['title', 'vote_average','vote_count']].sort_values('vote_average', ascending=False)[:10]
# ํ‰์ ์ด ๋†’์€ ์ˆœ์„œ๋Œ€๋กœ(๋‚ด๋ฆผ์ฐจ์ˆœ) ์ •๋ ฌ

ํ‰๊ฐ€ ํšŸ์ˆ˜๊ฐ€ ๋งค์šฐ ์ ์€ ์˜ํ™”๋“ค์ด ์ƒ์œ„๊ถŒ์— ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธ.

์ด๋ ‡๊ฒŒ ์™œ๊ณก๋œ ํ‰์  ๋ฐ์ดํ„ฐ๋ฅผ ํšŒํ”ผํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ€์ค‘์น˜๊ฐ€ ๋ถ€์—ฌ๋œ ํ‰์ ์„ ์‚ฌ์šฉ!

 


2. ๊ธฐ์กด ํ‰์ ์„ ๊ฐ€์ค‘ ํ‰์ ์œผ๋กœ ๋ณ€๊ฒฝ!

 

- v(vote_count) : ๊ฐœ๋ณ„ ์˜ํ™”์— ํ‰์ ์„ ํˆฌํ‘œํ•œ ํšŸ์ˆ˜-

- m : ํ‰์ ์„ ๋ถ€์—ฌํ•˜๊ธฐ ์œ„ํ•œ ์ตœ์†Œ ํˆฌํ‘œ ํšŸ์ˆ˜

- R(vote_average) : ๊ฐœ๋ณ„ ์˜ํ™”์— ๋Œ€ํ•œ ํ‰๊ท  ํ‰์ 

- C(vote_average.mean()) : ์ „์ฒด ์˜ํ™”์— ๋Œ€ํ•œ ํ‰๊ท  ํ‰์ 

 

์—ฌ๊ธฐ์„œ m๊ฐ’์€ ํˆฌํ‘œ ํšŸ์ˆ˜์— ๋”ฐ๋ฅธ ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ ˆํ•˜๋Š” ์—ญํ• ์„ ํ•˜๋Š”๋ฐ

m๊ฐ’์„ ๋†’์ด๋ฉด ํ‰์ ํˆฌํ‘œ ํšŸ์ˆ˜๊ฐ€ ๋งŽ์€ ์˜ํ™”์— ๋” ๋งŽ์€ ๊ฐ€์ค‘ ํ‰๊ท ์„ ๋ถ€์—ฌํ•ฉ๋‹ˆ๋‹ค..!

*ํ‰๊ฐ€ํ•œ ์‚ฌ๋žŒ์ด ๋งŽ์„ ์ˆ˜๋ก ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์˜๋ฏธ(?)

C=movies_df['vote_average'].mean()
#m๊ฐ’์€ ์ƒ์œ„ 60ํผ์„ผํŠธ์— ํ•ด๋‹นํ•˜๋Š” ํšŸ์ˆ˜๋ฅผ ๊ธฐ์ค€
m=movies_df['vote_count'].quantile(0.6)

print('C :', round(C,3), 'm: ', round(m,3))

 

~ํ‰์ ์„ ๊ฐ€์ค‘์น˜ ํ‰์ ์œผ๋กœ ๋ฐ”๊พธ๋Š” ํ•จ์ˆ˜~

def weighted_vote_average(record):
    v=record['vote_count']
    R=record['vote_average']
    
    return ((v/(v+m))*R)+((m/(m+v))*C)

 

~ํ•จ์ˆ˜ ์ ์šฉ~

: ๊ฐ€์ค‘ํ‰์ ์€ 'weighted_vote' ์ปฌ๋Ÿผ์„ ์ƒˆ๋กœ ๋งŒ๋“ค์–ด ๊ฐ’์„ ๋„ฃ์–ด์ฃผ์—ˆ๋‹ค..!

movies_df['weighted_vote']=movies_df.apply(weighted_vote_average, axis=1)

 

์ƒˆ๋กญ๊ฒŒ ๋ถ€์—ฌ๋œ weighted_vote ํ‰์ ์ด ๋†’์€ ์ˆœ์œผ๋กœ ์ƒ์œ„ 10๊ฐœ์˜ ์˜ํ™”๋ฅผ ์ถœ๋ ฅํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!

movies_df[['title', 'vote_average', 'weighted_vote', 'vote_count']].sort_values('weighted_vote', ascending=False)[:10]

ํ‰๊ฐ€ํ•œ ์‚ฌ๋žŒ์ด ๋งŽ์€ ์˜ํ™”๊ฐ€ ์ƒ์œ„๊ถŒ์— ์žˆ๋„ค์šฅ..!

 


3. ์žฅ๋ฅด ์œ ์‚ฌ์„ฑ์ด ๋†’์€ ์˜ํ™” top_n์˜ 2๋ฐฐ์ˆ˜ -> weighted_vote๊ฐ’์ด ๋†’์€ ์ˆœ์œผ๋กœ ์ถ”์ถœ

 

~ํ•จ์ˆ˜ ์ •์˜~

#๊ฐ€์ค‘์น˜ํ‰์ ์ด ํฌํ•จ๋œ ์ƒˆ๋กœ์šด ํ•จ์ˆ˜ ์ •์˜
def find_sim_movie(df, sorted_ind, title_name, top_n=10):
    title_movie=df[df['title']==title_name]
    title_index=title_movie.index.values
    
    #top_n์˜ 2๋ฐฐ์— ํ•ด๋‹นํ•˜๋Š” ์žฅ๋ฅด ์œ ์‚ฌ์„ฑ์ด ๋†’์€ index ์ถ”์ถœ
    similar_indexes=sorted_ind[title_index, :(top_n*2)]
    similar_indexes=similar_indexes.reshape(-1)
    #๊ธฐ์ค€ ์˜ํ™” index๋Š” ์ œ์™ธ(์ž๊ธฐ์ž์‹  ์ œ์™ธ(?))
    similar_indexes=similar_indexes[similar_indexes !=title_index]
    
    #top_n์˜ 2๋ฐฐ์— ํ•ด๋‹นํ•˜๋Š” ํ›„๋ณด๊ตฐ์—์„œ weighted_vote ๋†’์€ ์ˆœ์œผ๋กœ top_n๋งŒํผ ์ถ”์ถœ
    return df.iloc[similar_indexes].sort_values('weighted_vote', ascending=False)[:top_n]

 

~ํ•จ์ˆ˜ ์ ์šฉ~

similar_movies=find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather', 10)
similar_movies[['title', 'vote_average', 'weighted_vote']]

์ด์ „ ์ถ”์ฒœ ์˜ํ™”๋ณด๋‹ค ํ›จ์”ฌ ๋‚˜์€ ์˜ํ™”๊ฐ€ ์ถ”์ฒœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค..!

728x90