Discussioni utente:Мишоко

Contenuti della pagina non supportati in altre lingue.
Da Wikisource.

Benvenuto[modifica]

se non lo vedi, clicca qui!
Welcome to Wikisource!

Hallo, Мишоко, benvenuto - Welcome to Wikisource, the free library!

If you wish to collaborate, please read What wikisource is, our guidelines and our How-to help.

If you have any question, doubt or curiosity, don't hesitate to ask in our Village Pump, either to an administrator or to any user currently online as you can discover following the recent changes.

Happy editing and have fun!  from all Wikilibrarians.

A warm welcome from me, of course and for any need don't hesitate to contact me.

εΔω 13:43, 11 dic 2022 (CET)[rispondi]

how did you find all these errors?[modifica]

Hi Мишоко, and thank you so much for Utente:Мишоко/The so-called riletta pages are not so riletta. Obviously our proofreading system is far from perfect, but thanks to you, we are now fixing all these mistakes that we missed previously. I'm interested to know more about the procedure you used to find them: can you share some technical details? It would be great if your procedure could run automatically on a regular basis, like daily or at least weekly, writing on a project page that we can check. Please let me know! Thank you. Can da Lua (disc.) 12:57, 20 dic 2022 (CET)[rispondi]

Hello. Do you understand this query? At the French Wikisource there are people who correct hundreds or thousands of pages every month with queries such as this one.
If you want to do something more sophisticated, there are dumps of the Wikimedia projects that are made available twice a month for example here. You can download the one that's called "All pages, current versions only." That way you can run queries that do not have the limitations of the Wikisource search engine, for example :
$ grep -E "{{Ec\|([a-z]+)\|\1}}" it*
{{Ec|diascheuasti|diascheuasti}} etc.
$ grep -Ev " dei (loro|molto|meno|.*quattro|.*cento) " it* | grep -E " dei [a-z]+o "
casa dei suo amico etc.
I've tried a different approach and written scripts that don't require much knowledge of the language, which is how although they were originally built at the French Wikisource I can now run them here. It's all very basic, the only major issue is performance because the size of the French dump is 12 GB and it's not easy to process a 12 GB file, especially if you want to "passare tutte le pagine e memorizzarti tutti questi nomi che non compaiono nel vocabolario".
I'm not planning to create a procedure that runs regularly. I will stop when I reach 1000 "errors" and then I will shift to some other language, probably Polish or Dutch.
Have you heard of dicompte.toolforge.org, it was really a brilliant idea but I think whoever built it left long ago and the data hasn't been refreshed for years. Мишоко (disc.) 00:57, 21 dic 2022 (CET)[rispondi]
Yes, sometimes I do queries like that, but your approach looks much more efficient... In 500 results, so far we dind't find any false positive. That's truly impressive, and I still don't understand how you managed to do it. How do you distinguish an archaic spelling from a real typo? A vocabulary can hardly cover all the cases. Do you check word frequency to accept words that appear often? Does your script generate a list and then you check it manually to remove false positives? Can da Lua (disc.) 11:16, 21 dic 2022 (CET)[rispondi]
I don't think my approach is more efficient, at least not if you know the language. It's just different, so I may not catch the same fishes. I am definitely getting loads of false positives. Most of your questions show you and I are looking at things in totally opposite ways:
  • "A vocabulary can hardly cover all the cases." You don't need to cover all the cases if you're just targeting 1,000 errors and there are so many, many more.
  • "Do you check word frequency to accept words that appear often?" The idea is not to accept words that are properly spelt, it's to reject a few words that look highly suspicious - and there will be false positives.
  • "How do you distinguish an archaic spelling from a real typo?" Potentially, the real typo may look highly suspicious while the archaic spelling does not.
I do not need a "vocabolario" that will take weeks to build and will be rendered useless once I've moved to the Polish Wikisource.
What I do need is a variety of criteria that I can feed my computer with so it can identify highly suspicious words and output them in context. I would call this my secret recipe but clearly, you don't need to win MasterChef to know why the word "matnmonio" is highly suspicious:
Like I said, performance is the only issue. Мишоко (disc.) 14:58, 22 dic 2022 (CET)[rispondi]