All Forums Aster
sastd 1 post Joined 05/15
12 Oct 2015
Aster Association Analysis - CFilter Example

CFilter

What Is It?
This function performs collaborative filtering by using a series of SQL commands and SQL-MapReduce functions. You run this function by using an internal JDBC wrapper function.

Usage

SELECT * FROM CFilter (
ON (SELECT 1)
PARTITION BY 1
[DOMAIN('host:port')]
[DATABASE('db_name')]
[USERID('user_id')]
[PASSWORD('password')]
[SSLSETTINGS('SSLsettings')]
[SSLTRUSTSTOREPASSWORD('SSLtruststorepassword')]
INPUTTABLE('input_table_name')
OUTPUTTABLE('output_table_name')
INPUTCOLUMNS('source_column1', 'source_column2',...)
JOINCOLUMNS('join_column1', 'join_column2',...)
[OTHERCOLUMNS('other_column1', 'other_column2',...)]
[PARTITIONKEYCOLUMN('partitionKeyColumn1')]
[MAXSET('max_item_set')]
[DROPTABLE('true'|'false')]
);

Argument

Required?

Description

INPUTTABLE

Yes

Name of the input table containing the data to be filtered.

OUTPUTTABLE

Yes

Name of the output table into which the function writes the final results. If the output table already exists, then you should also pass the DROPTABLE (‘true’) argument, to drop it before writing the new results. Otherwise, an exception is thrown.

INPUTCOLUMNS

Yes

A list of input columns to filter. The column names are single-quoted and written in the comma-delimited format <'col1', 'col2', ...>.

JOINCOLUMNS

Yes

A list of columns to join on. The column names are single-quoted and written in the comma-delimited format <'col1', 'col2', ...>.

OTHERCOLUMNS

No

A comma-separated list of the names of other input columns to add to the output. These columns pass through the function unchanged. The column names should be singlequoted (<'col1', 'col2', ...>). The output is partitioned based on the columns specified by this argument. If this argument is not specified, the input data is regarded as belonging to one partition.

PARTITIONKEY

No

Single column used as partition key for the newly created output table. Default partitionKey is col1_item1.

MAXSET

No

Size of the maximum item set to be considered. Default is 100.

DROPTABLE

No

If you set this option to true, if the output table name already exists, the function drops the table. Default value is false.

Notes

  1. Specifying the same column in the OTHERCOLUMNS and JOINCOLUMNS arguments results in an incorrect counting in the groups.

Input Data
Data for the following example is based on listening habits prior to 2009 of last.fm users from http://ocelma.net/MusicRecommendationDataset/lastfm- 360K.html, courtesy of last.fm.
The original data set contained the listening data for 360 thousand users, amounting to almost 18 million rows (1.52 GB in size).
Therefore, the data set below is a cleaned subset of the original data set (1 thousand users and about 100 thousand rows) to save space and make it easier to work with the data.
Data: thousand_users.csv
Loading the Data: data_loading.sql

Run Time Analysis

Number of Rows

Run Time

98542

00:00:40.014

Example
First, we'll get the top scores from CFilter for all artists:

SELECT * FROM CFilter (
ON (SELECT 1)
PARTITION BY 1
INPUTTABLE('ja186045.onekusers')
OUTPUTTABLE('ja186045.onekuserscfilter')
INPUTCOLUMNS('artist')
JOINCOLUMNS('user_id')
DROPTABLE('true')
);

SELECT * FROM ja186045.onekuserscfilter
ORDER BY score DESC
LIMIT 5;

col1_item1

col1_item2

cntb

cnt1

cnt2

score

support

confidence

lift

z_score

deolinda

coimbra

1

1

1

1

0.001

1

1000

-0.217661878

hazel

spoonboy

1

1

1

1

0.001

1

1000

-0.217661878

downy

triceratops

1

1

1

1

0.001

1

1000

-0.217661878

tua

chaoze one

1

1

1

1

0.001

1

1000

-0.217661878

run devil run

all out war

1

1

1

1

0.001

1

1000

-0.217661878

As can be seen, the results don't provide much value as the artist pairings with the highest scores are those that were listened to the least.
Let's look at a couple examples for more popular artists:

SELECT * FROM ja186045.onekuserscfilter
WHERE col1_item1 = 'kanye west' or col1_item2 = 'kanye west'
ORDER BY score DESC
LIMIT 4;

col1_item1

col1_item2

cntb

cnt1

cnt2

score

support

confidence

lift

z_score

kanye west

jay-z

20

80

34

0.147058824

0.02

0.25

7.352941176

18.88348159

jay-z

kanye west

20

34

80

0.147058824

0.02

0.588235294

7.352941176

18.88348159

the game

kanye west

13

16

80

0.13203125

0.013

0.8125

10.15625

11.84621821

kanye west

the game

13

80

16

0.13203125

0.013

0.1625

10.15625

11.84621821

Since we looked at Kanye, we might as well look at Taylor Swift too because of their history.

SELECT * FROM ja186045.onekuserscfilter
WHERE col1_item1 = 'taylor swift' or col1_item2 = 'taylor swift'
ORDER BY score DESC
LIMIT 4;

col1_item1

col1_item2

cntb

cnt1

cnt2

score

support

confidence

lift

z_score

taylor swift

carrie underwood

4

10

5

0.32

0.004

0.4

80

2.798308144

carrie underwood

taylor swift

4

5

10

0.32

0.004

0.8

80

2.798308144

kellie pickler

taylor swift

2

2

10

0.2

0.002

1

100

0.787661463

taylor swift

reba mcentire

2

10

2

0.2

0.002

0.2

100

0.787661463

The results from these two queries show that users typically listen to a particular genre when it comes to these artists. People who listen to Kanye West also listen to other rappers such as Jay-Z or The Game and those who listen to Taylor Swift also listen to other country artists such as Carrie Underwood, Kellie Pickler, and Reba McEntire. If this was performed on a more recent data set, the results would probably be different as Taylor Swift has transitioned from a country singer to a pop artist.
The output from this function could be used to create a recommendation engine, like the ones used by streaming services such as last.fm, Pandora, and Spotify. By finding users that have similar tastes in music, this system could provide artist suggestions that the user may like based on their past listening history.
 

 

You must sign in to leave a comment.