11
Aug
11

meeting 10/08/2011

In the meeting we talked about the latest situation of our TUBITAK report. Currently we finalized our tests and proceeding on report writing progress. We also talked about the paper: Hearst, M.A.: Texttiling: Segmenting text into multi-paragraph subtopic passages.
Computational Linguistics 23(1) (1997) 33-64

At the end of the meeting we took the following decisions:

  • Proceeding on TUBITAK report
28
Jul
11

meeting 27/07/2011

In the meeting we revised Anil’s research documentation which is a new approach to search result clustering and labeling.

At the end of the meeting we took the following decisions:

  • Making experiments over divan data for the upcoming TUBITAK report
  • Documenting the experimental results of Evliya Celebi’s Seyahatname in Turkish
22
Jul
11

meeting 20/07/2011

In the meeting we did rehearsal of our plagiarism method (P^2CD) presentation. We also talked about Bilge’s diversification process.

At the end of the meeting we took the following decisions:

  • Completing our plagiarism method documentation.
  • Proceeding on diversification process by implementing a php interface.
14
Jul
11

meeting 13/07/2011

In the meeting we talked about latest situation of our external plagiarism and parallel corpora detection algorithm P^2CD. We also talked about details of language model implementation within the scope of information retrieval systems.

For the next week, we took the following decisions:

  • Proceeding in documentation of  P^2CD.
08
Jul
11

meeting 07/07/2011

In the meeting we talked about our plagiarism detection algorithm (P^2CD) and its experimental results. The algorithm gives competitive results over the raw plagiarism corpus. Paired t-test results show that Levenstein metric has a better overall result for this dataset. However, our method is bilingual and it is not only designed for plagiarism detection but also for parallel corpora detection which can not be accomplished by using Levenstein distance since it is based on monolingual string comparisons.

30
Jun
11

meeting 29/06/2011

In the meeting we talked about Bilge’s diversification research progress as well as our plagiarism detection method and Anil’s summarization process. According to test results, our external and plagiarism detection algorithm seems only working in non-obfuscated plagiarism cases (raw plagiarism).

For the next meeting we took the following decisions:

-Proceeding on tests over our plagiarism and parallel corpora detection algorithm.

22
Jun
11

meeting 22/06/2011

In the meeting we talked about latest situation of our external plagiarism and parallel corpora detection algorithm. According to results, our method seems successful when the plagiarism doesn’t contain obfuscation. We also talked about Anil’s summarization process and Bilge’s diversification research progress.

At the end of the meeting we took the following decisions:

  • Continuing on plagiarism detection tests.
16
Jun
11

meeting – 15/06/2011

We talked in the meeting about the latest status of plagiarism detection algorithm as we as Anil’s summarization process. Currently near duplicate news detection with our plagiarism detection algorithm with considering sliding window size is working too slow.

At the end of the meeting we took the following decisions:

  • Improving the algorithm for near duplicate news detection with sliding window sizes.
  • Proceeding on plagiarism detection tests.
11
Jun
11

meeting – 08/06/2011

In the meeting we talked about Anil’s summarization process and our proposed external plagiarism and parallel corpora detection algorithm. We started using the plagiarism detection algorithm over our Bilkent Information Retrieval Group near duplicate dataset.

Currently the algorithm gives poor results over this dataset. This is because of the high level of obfuscation, ignorance of stepsize while adapting our algorithm for near duplicate dataset and inadequate ground truth information.

For the next meeting we took the following decisions:

  • Continuing to near duplicate detection tests by considering stepsize in blocks.
03
Jun
11

meeting – 01/06/2011

We talked about the latest progress in our plagiarism detection approach. Currently we detected the best configuration over a test dataset and according to results, best configuration is detected as blocksize: 300 words, stepsize: 3 words, documentsize: 10 words, clusterthreshold (difference percentage between the number of clusters of compared blocks): 20%, yao difference threshold (difference between the actual cluster distribution average and yao (random) distribution): 20, consecutiveness threshold: 2, and pair threshold: 3.

At the end of the meeting we took the following decisions:

  • Continue to tests over 3 different corpus which include plagiarism cases with different obfuscation levels (none, low and high).



Follow

Get every new post delivered to your Inbox.